Big Data Processing Experience in the ATLAS Experiment



Similar documents
ATLAS Petascale Data Processing on the Grid: Facilitating Physics Discoveries at the LHC

Data analysis in Par,cle Physics

Secure Hybrid Cloud Infrastructure for Scien5fic Applica5ons

Status and Evolution of ATLAS Workload Management System PanDA

Clusters in the Cloud

Data Center Evolu.on and the Cloud. Paul A. Strassmann George Mason University November 5, 2008, 7:20 to 10:00 PM

Summer Student Project Report

Computing at the HL-LHC

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

The Theory And Practice of Testing Software Applications For Cloud Computing. Mark Grechanik University of Illinois at Chicago

Experiments on cost/power and failure aware scheduling for clouds and grids

(Possible) HEP Use Case for NDN. Phil DeMar; Wenji Wu NDNComm (UCLA) Sept. 28, 2015

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

ITS Strategic Plan Enabling an Unbounded University

Project Overview. Collabora'on Mee'ng with Op'mis, Sept. 2011, Rome

An Integrated CyberSecurity Approach for HEP Grids. Workshop Report.

Big Science and Big Data Dirk Duellmann, CERN Apache Big Data Europe 28 Sep 2015, Budapest, Hungary

Evolution of Database Replication Technologies for WLCG

Tier0 plans and security and backup policy proposals

The Data Quality Monitoring Software for the CMS experiment at the LHC

Interac(ve Broker (UK) Limited Webinar: Proprietary Trading Groups

Bulletin. Introduction. Dates and Venue. History. Important Dates. Registration

Using S3 cloud storage with ROOT and CernVMFS. Maria Arsuaga-Rios Seppo Heikkila Dirk Duellmann Rene Meusel Jakob Blomer Ben Couturier

With DDN Big Data Storage

Big Data Needs High Energy Physics especially the LHC. Richard P Mount SLAC National Accelerator Laboratory June 27, 2013

An Open Dynamic Big Data Driven Applica3on System Toolkit

The CMS analysis chain in a distributed environment

HIGH ENERGY PHYSICS EXPERIMENTS IN GRID COMPUTING NETWORKS EKSPERYMENTY FIZYKI WYSOKICH ENERGII W SIECIACH KOMPUTEROWYCH GRID. 1.

GridFTP GUI: An Easy and Efficient Way to Transfer Data in Grid

New Design and Layout Tips For Processing Multiple Tasks

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Meeting Management Solution. Technology and Security Overview N. Dale Mabry Hwy Suite 115 Tampa, FL Ext 702

RAID Basics Training Guide

1. Base Programming. GIORGIO RUSSOLILLO - Cours de prépara+on à la cer+fica+on SAS «Base Programming»

Evolution of the ATLAS PanDA Production and Distributed Analysis System

Grid Computing in Aachen

Techniques for implementing & running robust and reliable DB-centric Grid Applications

Big data management with IBM General Parallel File System

OS/Run'me and Execu'on Time Produc'vity

The Development of Cloud Interoperability

New Jersey Big Data Alliance

CMS Tier-3 cluster at NISER. Dr. Tania Moulik

From Distributed Computing to Distributed Artificial Intelligence

An introduction to disaster recovery. And how DrAAS from I.R.I.S. Ondit can help!

US NSF s Scientific Software Innovation Institutes

Return on Experience on Cloud Compu2ng Issues a stairway to clouds. Experts Workshop Nov. 21st, 2013

Protec'ng Informa'on Assets - Week 8 - Business Continuity and Disaster Recovery Planning. MIS 5206 Protec/ng Informa/on Assets Greg Senko

The Emerging Discipline of Data Science. Principles and Techniques For Data- Intensive Analysis

NextGen Infrastructure for Big DATA Analytics.

How To Teach Physics At The Lhc

A Physics Approach to Big Data. Adam Kocoloski, PhD CTO Cloudant

ATLAS job monitoring in the Dashboard Framework

Chapter 7. Using Hadoop Cluster and MapReduce

IT Change Management Process Training

Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe

Scalable Multi-Node Event Logging System for Ba Bar

HEP computing and Grid computing & Big Data

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

ATLAS Data Management Accounting with Hadoop Pig and HBase

Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft. Holger Marten. Holger. Marten at iwr. fzk. de

Mission. To provide higher technological educa5on with quality, preparing. competent professionals, with sound founda5ons in science, technology

From raw data to Pbytes on disk The world wide LHC Computing Grid

Transcription:

Big Data Processing Experience in the ATLAS Experiment A. on behalf of the ATLAS Collabora5on Interna5onal Symposium on Grids and Clouds (ISGC) 2014 March 23-28, 2014 Academia Sinica, Taipei, Taiwan

Introduction To improve the data quality for physics analysis and extend physics reach, the ATLAS collabora5on rou5nely reprocess petabytes of data on the Grid During LHC data taking, we completed three major data reprocessing campaigns, with up to 2 PB of raw data being reprocessed every year At the 5me of the conference, the latest data reprocessing campaign of more than 2 PB of 2012 pp data is nearing comple5on The demands on Grid compu5ng resources grow, as scheduled LHC upgrades will increase the data taking rates tenfold Since a tenfold increase in WLCG resources is not an op5on, a comprehensive model for the composi5on and execu5on of the data processing workflow within given CPU and storage constraints is necessary to accommodate physics needs of the next LHC run We will report on experience gained in ATLAS Big Data processing and on efforts underway to scale up Grid data processing beyond petabytes 2

ATLAS Detector 7000 tons, 88 million electronics channels, raw event size ~1 MB With up to 3 billion events per year ATLAS records petabytes of LHC collision events 3

Detector Data Processing A star5ng point for physics analysis is the reconstruc5on of raw event data from detector Applica5ons are processing raw detector data with sophis5cated algorithms to iden5fy and reconstruct physics objects such as charged par5cle tracks 4

Big Data Processing on the Grid High Energy Physics data are comprised of independent events Reconstruc5on applica5ons process one event at a 5me One raw file contains events taken in a few minutes A dataset contains files with events close in 5me The first- pass processing of all raw event data at the ATLAS Tier- 0 compu5ng site at CERN provides promptly the data for quality assessment and physics analysis To extend physics reach, the quality of the reconstructed data is improved by further op5miza5ons of socware algorithms and condi5ons/calibra5ons data For data processing with improved socware and condi5ons/calibra5ons (reprocessing) ATLAS uses ten Tier- 1 compu5ng sites distributed on the Grid 5

Increasing Big Data Processing Throughput High throughput is cri5cal for 5mely comple5on of the reprocessing campaigns conducted in prepara5on for major physics conferences During LHC data- taking, the eight- fold increase in the throughput of Big Data processing was achieved 3.5 M jobs processed 2 PB of 2012 data 1.1 M jobs processed in four weeks 1 PB of 2011 data 0.9 M jobs processed in four weeks 1 PB of 2010 data in two months 2010 2011 2012 6

Big Data Processing Throughput For a faster throughput, the number of jobs running concurrently exceeded 33k during ATLAS reprocessing campaign in November 2012 For comparison the daily average number of running jobs remained below 20k during the legacy reprocessing of 2012 pp data conducted by the CMS experiment in January- March 2013 K. Bloom CMS Use of a Data Federa5on CMS CR - 2013/339 7

2013 Reprocessing Campaign To increase ATLAS physics output, the reprocessing gives possibility to find new signatures through the subsequent analysis of the LHC Run- 1 data Such as look for heavy, long- lived par5cles predicted by several SUSY and exo5c models Input data volume: 2.2 PB Using trigger signatures ~15% of events are selected in three major physics streams High throughput not required in this campaign the slow- burner schedule requires just 15% of the resources available Reprocessing status: more than 95% done 8

Engineering Reliability ATLAS data reprocessing on the Grid tolerates a con5nuous stream of failures, errors and faults Our experience has shown that Grid failures can occur for a variety of reasons Grid heterogeneity makes failures hard to diagnose and repair quickly While many fault- tolerance mechanisms improve the reliability of data reprocessing on the Grid, their benefits come at costs Reliability Engineering provides a framework for fundamental understanding of Big Data processing which is not a desirable enhancement but a necessary requirement 9

CHEP2012: Costs of Recovery from Failures Job re- tries avoids data loss at the expense of CPU 5me used by the failed jobs Distribu5on of tasks 1 ranked by CPU 5me used to recover from transient failures is not uniform: Most of CPU 5me required for recovery was used in a small frac5on of tasks CPU- hours used to recover from transient failures Task Rank 1 In ATLAS data reprocessing jobs from the same run are processed in the same task 10

Histogram of Transient Failures Recovery Costs Number of Tasks CPU- hours used to recover from transient failures 100 75 2010 2011 50 25 0-2.5-1.5-0.5 0.5 1.5 2.5 3.5 4.5 log10(cpu-hours) Task Rank 11

Grid2012: Changes in Failure Recovery Costs 100 Number of Tasks 75 50 25 2010 2011 The major costs were reduced in 2011 Majority of the costs are from storage failures at the end of job 0-2.5-1.5-0.5 0.5 1.5 2.5 3.5 4.5 log10(cpu-hours) 12

CHEP2013: Same Behavior in 2012 Reprocessing #!!!!$ #!!!$ '!#!$ '!##$ '!#'$!"#$%&'()* #!!$ #!$ #$!"#$!"!#$!$ %!!$ &!!$ #'!!$ #(!!$ +,)-*.,/-* There were more tasks in 2012 reprocessing of 2 PB of 2012 p- p data 13

2013 Reprocessing: Confirms Universal Behavior!"#$%&'()* #!!!!$ #!!!$ #!!$ #!$ '!#!$ '!##$ '!#'$ '!#)$ #$!"#$!"!#$!$ %!!$ &!!$ #'!!$ #(!!$ +,)-*.,/-* 14

CPU-time Used to Recover from Job Failures 0.15 Number of Tasks (Normalized) 0.1 2010 2011 2012 2013 0.05 0-2.5-1.5-0.5 0.5 1.5 2.5 3.5 4.5 log10(cpu-hours) 15

Big Data Processing on the Grid: Performance Reprocessing campaign Input Data Volume (PB) CPU Time Used for Reconstruction (10 6 h) Fraction of CPU Time Used for Recovery (%) 2010 1 2.6 6.0 2011 1 3.1 4.2 2012 2 14.6 5.6 2013 2 4.4 3.1 16

Scaling Up Big Data Processing beyond Petabytes The demands on Grid compu5ng resources grow, as scheduled LHC upgrades will increase ATLAS data taking rates a comprehensive model for the composi5on and execu5on of the data processing workflow within given CPU and storage constraints is necessary to accommodate physics needs of the next LHC run Coordinated efforts are underway to scale up Grid data processing beyond petabytes Preparing ATLAS Distributed Compu5ng for LHC Run 2 hnp://indico3.twgrid.org/indico/contribu5ondisplay.py?contribid=160&sessionid=44&confid=513 PanDA's Role in ATLAS Compu5ng Model Evolu5on hnp://indico3.twgrid.org/indico/contribu5ondisplay.py?contribid=162&sessionid=44&confid=513 Integra5ng Network Awareness in ATLAS Distributed Compu5ng hnp://indico3.twgrid.org/indico/contribu5ondisplay.py?contribid=189&sessionid=54&confid=513 Extending ATLAS Compu5ng to Commercial Clouds and Supercomputers hnp://indico3.twgrid.org/indico/contribu5ondisplay.py?contribid=191&sessionid=55&confid=513 17

Conclusions Reliability Engineering is an ac5ve area of research providing solid founda5ons for the efforts underway to scale up Grid data processing beyond petabytes Maximizing throughput During LHC data- taking, ATLAS achieved an eight- fold increase in the throughput of Big Data processing on the Grid Minimizing costs of recovery from transient failures ATLAS Big Data processing on the Grid keeps the cost of automa5c re- tries of the failed jobs at the level of 3-6% of total CPU- hours used for data reconstruc5on Predic9ng performance Despite substan5al differences in all four ATLAS major data reprocessing campaigns on the Grid, we found that the distribu5on of the CPU- 5me used to recover from transient job failures exhibits the same general log- normal behavior The ATLAS experiment con5nues op5mizing the use of Grid compu5ng resources in prepara5on for the LHC data taking in 2015 18

Extra Materials

Increasing Big Data Processing Throughput!"#"$%"&'%(%)*+,%-*%./*01,,%#%23%*4%!"#"% 56-6%78%-9*%:*8-;,%!"##$%#&#%%(%)*+,%-*%./*01,,%#%23%*4%!"##% 56-6%78%4*</%911=,%!"#!$%>&?%%(%)*+,%-*%./*01,,%!%23%*4%!"#!% 56-6%78%4*</%911=,% High throughput is cri5cal for 5mely comple5on of the reprocessing campaigns conducted in prepara5on for major physics conferences In 2011 reprocessing the throughput doubled in comparison to the 2010 reprocessing campaign To deliver new physics results for the 2013 Moriond Conference, ATLAS reprocessed twice more data in November 2012 within the same 5me period as in 2011 reprocessing, while due to increased LHC pileup, the 2012 pp events required twice more 5me to reconstruct than 2011 events 20