Big Data Processing Experience in the ATLAS Experiment A. on behalf of the ATLAS Collabora5on Interna5onal Symposium on Grids and Clouds (ISGC) 2014 March 23-28, 2014 Academia Sinica, Taipei, Taiwan
Introduction To improve the data quality for physics analysis and extend physics reach, the ATLAS collabora5on rou5nely reprocess petabytes of data on the Grid During LHC data taking, we completed three major data reprocessing campaigns, with up to 2 PB of raw data being reprocessed every year At the 5me of the conference, the latest data reprocessing campaign of more than 2 PB of 2012 pp data is nearing comple5on The demands on Grid compu5ng resources grow, as scheduled LHC upgrades will increase the data taking rates tenfold Since a tenfold increase in WLCG resources is not an op5on, a comprehensive model for the composi5on and execu5on of the data processing workflow within given CPU and storage constraints is necessary to accommodate physics needs of the next LHC run We will report on experience gained in ATLAS Big Data processing and on efforts underway to scale up Grid data processing beyond petabytes 2
ATLAS Detector 7000 tons, 88 million electronics channels, raw event size ~1 MB With up to 3 billion events per year ATLAS records petabytes of LHC collision events 3
Detector Data Processing A star5ng point for physics analysis is the reconstruc5on of raw event data from detector Applica5ons are processing raw detector data with sophis5cated algorithms to iden5fy and reconstruct physics objects such as charged par5cle tracks 4
Big Data Processing on the Grid High Energy Physics data are comprised of independent events Reconstruc5on applica5ons process one event at a 5me One raw file contains events taken in a few minutes A dataset contains files with events close in 5me The first- pass processing of all raw event data at the ATLAS Tier- 0 compu5ng site at CERN provides promptly the data for quality assessment and physics analysis To extend physics reach, the quality of the reconstructed data is improved by further op5miza5ons of socware algorithms and condi5ons/calibra5ons data For data processing with improved socware and condi5ons/calibra5ons (reprocessing) ATLAS uses ten Tier- 1 compu5ng sites distributed on the Grid 5
Increasing Big Data Processing Throughput High throughput is cri5cal for 5mely comple5on of the reprocessing campaigns conducted in prepara5on for major physics conferences During LHC data- taking, the eight- fold increase in the throughput of Big Data processing was achieved 3.5 M jobs processed 2 PB of 2012 data 1.1 M jobs processed in four weeks 1 PB of 2011 data 0.9 M jobs processed in four weeks 1 PB of 2010 data in two months 2010 2011 2012 6
Big Data Processing Throughput For a faster throughput, the number of jobs running concurrently exceeded 33k during ATLAS reprocessing campaign in November 2012 For comparison the daily average number of running jobs remained below 20k during the legacy reprocessing of 2012 pp data conducted by the CMS experiment in January- March 2013 K. Bloom CMS Use of a Data Federa5on CMS CR - 2013/339 7
2013 Reprocessing Campaign To increase ATLAS physics output, the reprocessing gives possibility to find new signatures through the subsequent analysis of the LHC Run- 1 data Such as look for heavy, long- lived par5cles predicted by several SUSY and exo5c models Input data volume: 2.2 PB Using trigger signatures ~15% of events are selected in three major physics streams High throughput not required in this campaign the slow- burner schedule requires just 15% of the resources available Reprocessing status: more than 95% done 8
Engineering Reliability ATLAS data reprocessing on the Grid tolerates a con5nuous stream of failures, errors and faults Our experience has shown that Grid failures can occur for a variety of reasons Grid heterogeneity makes failures hard to diagnose and repair quickly While many fault- tolerance mechanisms improve the reliability of data reprocessing on the Grid, their benefits come at costs Reliability Engineering provides a framework for fundamental understanding of Big Data processing which is not a desirable enhancement but a necessary requirement 9
CHEP2012: Costs of Recovery from Failures Job re- tries avoids data loss at the expense of CPU 5me used by the failed jobs Distribu5on of tasks 1 ranked by CPU 5me used to recover from transient failures is not uniform: Most of CPU 5me required for recovery was used in a small frac5on of tasks CPU- hours used to recover from transient failures Task Rank 1 In ATLAS data reprocessing jobs from the same run are processed in the same task 10
Histogram of Transient Failures Recovery Costs Number of Tasks CPU- hours used to recover from transient failures 100 75 2010 2011 50 25 0-2.5-1.5-0.5 0.5 1.5 2.5 3.5 4.5 log10(cpu-hours) Task Rank 11
Grid2012: Changes in Failure Recovery Costs 100 Number of Tasks 75 50 25 2010 2011 The major costs were reduced in 2011 Majority of the costs are from storage failures at the end of job 0-2.5-1.5-0.5 0.5 1.5 2.5 3.5 4.5 log10(cpu-hours) 12
CHEP2013: Same Behavior in 2012 Reprocessing #!!!!$ #!!!$ '!#!$ '!##$ '!#'$!"#$%&'()* #!!$ #!$ #$!"#$!"!#$!$ %!!$ &!!$ #'!!$ #(!!$ +,)-*.,/-* There were more tasks in 2012 reprocessing of 2 PB of 2012 p- p data 13
2013 Reprocessing: Confirms Universal Behavior!"#$%&'()* #!!!!$ #!!!$ #!!$ #!$ '!#!$ '!##$ '!#'$ '!#)$ #$!"#$!"!#$!$ %!!$ &!!$ #'!!$ #(!!$ +,)-*.,/-* 14
CPU-time Used to Recover from Job Failures 0.15 Number of Tasks (Normalized) 0.1 2010 2011 2012 2013 0.05 0-2.5-1.5-0.5 0.5 1.5 2.5 3.5 4.5 log10(cpu-hours) 15
Big Data Processing on the Grid: Performance Reprocessing campaign Input Data Volume (PB) CPU Time Used for Reconstruction (10 6 h) Fraction of CPU Time Used for Recovery (%) 2010 1 2.6 6.0 2011 1 3.1 4.2 2012 2 14.6 5.6 2013 2 4.4 3.1 16
Scaling Up Big Data Processing beyond Petabytes The demands on Grid compu5ng resources grow, as scheduled LHC upgrades will increase ATLAS data taking rates a comprehensive model for the composi5on and execu5on of the data processing workflow within given CPU and storage constraints is necessary to accommodate physics needs of the next LHC run Coordinated efforts are underway to scale up Grid data processing beyond petabytes Preparing ATLAS Distributed Compu5ng for LHC Run 2 hnp://indico3.twgrid.org/indico/contribu5ondisplay.py?contribid=160&sessionid=44&confid=513 PanDA's Role in ATLAS Compu5ng Model Evolu5on hnp://indico3.twgrid.org/indico/contribu5ondisplay.py?contribid=162&sessionid=44&confid=513 Integra5ng Network Awareness in ATLAS Distributed Compu5ng hnp://indico3.twgrid.org/indico/contribu5ondisplay.py?contribid=189&sessionid=54&confid=513 Extending ATLAS Compu5ng to Commercial Clouds and Supercomputers hnp://indico3.twgrid.org/indico/contribu5ondisplay.py?contribid=191&sessionid=55&confid=513 17
Conclusions Reliability Engineering is an ac5ve area of research providing solid founda5ons for the efforts underway to scale up Grid data processing beyond petabytes Maximizing throughput During LHC data- taking, ATLAS achieved an eight- fold increase in the throughput of Big Data processing on the Grid Minimizing costs of recovery from transient failures ATLAS Big Data processing on the Grid keeps the cost of automa5c re- tries of the failed jobs at the level of 3-6% of total CPU- hours used for data reconstruc5on Predic9ng performance Despite substan5al differences in all four ATLAS major data reprocessing campaigns on the Grid, we found that the distribu5on of the CPU- 5me used to recover from transient job failures exhibits the same general log- normal behavior The ATLAS experiment con5nues op5mizing the use of Grid compu5ng resources in prepara5on for the LHC data taking in 2015 18
Extra Materials
Increasing Big Data Processing Throughput!"#"$%"&'%(%)*+,%-*%./*01,,%#%23%*4%!"#"% 56-6%78%-9*%:*8-;,%!"##$%#&#%%(%)*+,%-*%./*01,,%#%23%*4%!"##% 56-6%78%4*</%911=,%!"#!$%>&?%%(%)*+,%-*%./*01,,%!%23%*4%!"#!% 56-6%78%4*</%911=,% High throughput is cri5cal for 5mely comple5on of the reprocessing campaigns conducted in prepara5on for major physics conferences In 2011 reprocessing the throughput doubled in comparison to the 2010 reprocessing campaign To deliver new physics results for the 2013 Moriond Conference, ATLAS reprocessed twice more data in November 2012 within the same 5me period as in 2011 reprocessing, while due to increased LHC pileup, the 2012 pp events required twice more 5me to reconstruct than 2011 events 20