Advancements in Big Data Processing

Transcription

1 Advancements in Big Data Processing Alexandre Vaniachine V Interna0onal Conference "Distributed Compu0ng and Grid- technologies in Science and Educa0on" (Grid2012) July 16 21, 2012 JINR, Dubna, Russia

2 Big Data and the Grid The forthcoming conference will be the fioh one organized by the JINR Laboratory of Informa0on Technologies star0ng from The distributed compu0ng and grid- technologies have gone beyond the academic interest since than. The present- day grid- technologies made it possible to integrate the compu0ng centers located at any place worldwide and to provide these compu0ng resources to the users. The present day high technologies ask for the development of high- performance compu0ng. The coverage of the ac0vi0es involving huge amounts of informa0on throughputs encompasses now not only various areas of natural sciences but also business and industry. During the last few years the implemented distributed compu0ng grid- infrastructure enables the solu0on of a wide class of tasks that are asking for the processing of enormous amount of data. 2

3 Impressive Conference Series 3

4 Fourth Paradigm In a speech given just a few weeks before he was lost at sea off the California coast in January 2007, Jim Gray, a database sooware pioneer and a MicrosoO researcher, sketched out an argument that exaflood of scien0fic data is fundamentally transforming the prac0ce of science Dr. Gray called the shio a fourth paradigm Invited talk at Grid2010 Conference, Dubna, Russia hzp:// A. Vaniachine 4

5 Big Data The ever- increasing volumes of scien0fic data present new challenges to Distributed Compu0ng and Grid- technologies The emerging Big Data revolu0on drives new discoveries in scien0fic fields including nanotechnology, astrophysics, high- energy physics, biology and medicine New ini0a0ves are transforming data- driven scien0fic fields enabling massive data analysis in new ways To address the Big Data processing challenge, the LHC experiments are relying on the computa0onal grids infrastructure deployed in the framework of the Worldwide LHC Compu0ng Grid Following Big Data processing on the Grid, more than 8000 scien0sts analyze LHC data in search of discoveries 5

6 Historic Milestone Expected from SM Higgs at given mh Global Effort Æ Global Success Results today only possible due to extraordinary performance of accelerators ± experiments ± Grid computing Observation of a new particle consistent with a Higgs Boson (but which RQH«" Historic Milestone but only the beginning Global Implications for the future!"#$%&'&($ 6

7 J. Incandela for the CMS COLLABORATION Big Data Processing for Discoveries GEN-SIM/AODSIM Production/Month!"#$%&'()*+&,#"-.-/+0+*+12##! !#9#:+1%6;<2#! 56-,%78369:;)%! A1'+B+6A1%1)&9# :+1%6C<2# 100 k July 4th 2012 The Status of the Higgs Search It would have been impossible to release physics results so quickly without! 7(2*-+,#CD?E&,*@# the outstanding performance of the Grid (including the CERN Tier-0) Number of concurrent ATLAS jobs Jan-July !#! <3(=)%/8>/%()%+?@A% A1'+B+6A1%1)&# Includes MC production, user and group analysis at CERN, 10 Tier1-s, ~ 70 Tier-2 federations! > 80 sites ;># > 1500 distinct ATLAS users do analysis on the GRID " Available resources fully used/stressed (beyond pledges in some cases) " Massive production of 8 TeV Monte Carlo samples " Very effective and flexible Computing Model and Operation team! accommodate high trigger rates and pile-up, intense MC simulation, analysis demands from worldwide users (through e.g. dynamic data placement) 12 Photo: Denis Balibouse, Pool/AP 7

8 Genesis Big Data experiments at LHC study quark- hadron transi0on (ALICE) electroweek phase transi0on Baryon number violated earlier mechanism: CP viola0on A.D. Sakharov (1967) CP- viola0on in Standard Model is too small for baryogenesis: indicates new physics CP- viola0on/bariogenesis beyond the LHC require Big Data 8

9 Future Big Data Experiments DAQ - event rate 748'9:$-+'1,;$'<=>>?&5+$'@A'B'C'&4&4%D' Experiment Event Size E,-,FE,3%)'$:$-+'1,;$'A'G'C'&4&4%' [kb] Rate [Hz] Rate [MB/s] High rate scenario for Belle II DAQ G'3)#,$1')K'%48'04+4' Belle II 300!?,*'$C#4-1,)-'K43+)%6'L' 6,000 1,800 LCG TDR (2005)* ALICE (HI) 12, '04+4'1+)%$0')-'+4#$' 100 1,250! ALICE (pp) 1, '9:$-+'1,;$'<=>>?&5+$'@A'B'C'&4&4%D' ATLAS 1,600 ' E,-,FE,3%)'$:$-+'1,;$'A'G'C'&4&4%' CMS 1,500 (HI'F'"-,+'J"*,6'B'C'&4&4%' G'3)#,$1')K'%48'04+4' LHCb 25!!?,*'$C#4-1,)-'K43+)%6'L' 2,000 50!"#$%&'#()'*#+,'*-%.,'/()('0(%!)*$'K%43+,)-')K'E,-,')-'0,1?'@=>>M'NO'=>MD' * The LHC experiments are,'%1+2$%3'()24%5$46+06% running J"*,' at a factor of two or higher event rates 5 July 2012 Martin Sevior, ICHEP, Melbourne 2012!"#$%&'()*#"+,-.'/$$01' 2,*,+$0'#%$3,1,)-'0"$'+)'*4-5'411"*#+,)-16'!+)%4.$'.%)81'K%)*'P@L>D'H&'+)'P@Q>>D'H&',-'Q'5$4%1R' 748'04+4'1+)%$0')-'+4#$'! Page 15!+)%4.$'.%)81'K%)*'P@L>D'H&'+)'P@Q>>D'H&',-'Q'5$4%1R' '!"#$%&'()*#"+,-.'/$$01' 2,*,+$0'#%$3,1,)-'0"$'+)'*4-5'411"*#+,)-16' (HI'F'"-,+'J"*,6'B'C'&4&4%'!)*$'K%43+,)-')K'E,-,')-'0,1?'@=>>M'NO'=>MD' J"*,' (HI'.%)81'K%)*'L>>'+)'=GS>>>'TU$#!#$3',-'Q'5$4%1R'!"#!$%#&'$($## )$%*+,'-$.#/01/#2,%3#4!"##11!"# (HI'.%)81'K%)*'L>>'+)'=GS>>>'TU$#!#$3',-'Q'5$4%1R'!"#$% 9

10 Big Data Processing in Future Experiments Belle II Computing Model Grid-based Distributed Computing Common framework for DAQ and offline basf2 based on root I/O Raw Data Storage and Processing Belle II detector MC Production and Ntuple Production MC Production (optional) Ntuple Analysis!"# Zden!k Dole"al SuperB Collaboration Meeting 1/6/ ! 10

11 Six Sigma Quality of Industrial Production In industry, Six Sigma analysis improves the quality of produc0on by iden0fying and removing the causes of defects A Six Sigma process is one in which products are free of defects at level In prac0ce, an industrial Six Sigma process corresponds to 4.5 sigma: taking into account the 1.5- sigma shio from varia0ons in produc0on process In contrast, Big Data processing requires six sigma quality in a true mathema0cal sense 10-8 level of defects 11

12 σ tot Tevatron LHC Why Six Sigma? 10-5 event selection in online data processing σ b σ (nb) σ jet (E T jet > s/20) σ W σ Z σ jet (E T jet > 100 GeV) σ t events/sec for L = cm -2 s -1 >1 PB per year 10-9 event selection in offline data processing σ jet (E T jet > s/4) σ Higgs (M H = 150 GeV) σ Higgs (M H = 500 GeV) s (TeV) Figure 4-1 Cross-section and rates (for a luminosity of cm 2 s 1 ) for various processes in proton Physics discovery requires six sigma quality during Big Data processing 12

13 Golden H 4l Channel: One Event per Billion LHC physics discovery requires Big Data processing errors below 10-8 level 13

14 Six Sigma Quality Failures recovered by retries achieves produc0on quality in Big Data processing on the Grid at the six sigma level No events were lost during the main ATLAS reprocessing campaign of the 2010 data that reconstructed on the Grid more than 1 PB of data with events In the last 2011 data reprocessing only two collision events out of events total could not be reconstructed These events were reprocessed later in a dedicated data recovery step Later, silent data corrup0on was detected in six events from the reprocessed 2010 data and in one case of five adjacent events from the 2011 reprocessed data Corresponding to event losses below the 10-8 level, this demonstrates sustained six sigma quality performance in Big Data processing CMS keeps event losses below the 10-8 level in the first pass processing (not Grid) 14

15 Task Processing: ATLAS vs. CMS In Big Data processing scien0sts deal with datasets, not individual ﬁles Thus, a task not a Request Define job is a major unit Task Tag Big Data processing on the Grid Task Evolution Recovery Speed up & Forcefinish DEfT: Dynamic Evolution for Tasks Forking & Extensions Transient & Intermediate Autosplitting Figure 1. Workload management architecture. 15

16 A Powerful Analogy!"#$%&'(&)*%&+,&% ' -. %/$00$%11%2$#$340%5$$-6#7*%8347"$*%9:$;.% Splipng of a large data processing task into jobs is similar to the splipng of a large file into smaller TCP/IP packets during the FTP data transfer Splipng data into smaller pieces achieves reliability by re- sending of the dropped TCP/IP packets!"#$%&'$(')"**"'++'',-&-'.-(/*0(1'2$%30(1'4%$5#' Likewise in Big Data processing transient job failures are recovered by retries In file transfer the TCP/IP packet is a unit of checkpoin0ng In HEP Big Data processing the checkpoin0ng unit is a job (PanDA) or a file (DIRAC) Unlike many 0ghtly- coupled high- performance compu0ng problems, which require inter- process communica0ons to be parallelized, high energy physics Big Data processing ooen called embarrassingly parallel, since each event is independent However, the event- level checkpoin0ng rarely used today, as its granularity is too small for Big Data processing in high energy physics Next genera0on systems need event- level checkpoin0ng for Big Data (see PanDA talk by K. De),%!"# 16

17 Reliability Engineering for Big Data on the Grid LHC experience has shown that Grid failures can occur for a variety of reasons Grid heterogeneity makes failures hard to diagnose and repair quickly Big Data processing on the Grid must tolerate a con0nuous stream of failures errors and faults While many fault- tolerance mechanisms improve the reliability of Big Data processing in the Grid, their benefits come at costs Reliability Engineering provides a framework for fundamental understanding of the Big Data processing on the Grid, which is not a desirable enhancement but a necessary requirement 17

18 Jim Gray Hypothesis Heisenbugs are the bugs which may or may not cause a fault for a given opera0on If a Heisenbug is present in the system, the error could be removed on retrying the opera0on Heisenbugs are also called as transient or intermizent faults Bohrbugs are the bugs which always cause a failure when a par0cular opera0on is performed In the presence of Bohrbugs, there would always be a failure on retrying the opera0on which caused the failure Bohrbugs are also called permanent faults 18

19 marked as down. If a component is down the operator proceeds to log in to the machine and check the logs for the given component, restart it, and make an entry on the team e-log about the incident. If the problem persists then one of the team leader checks and diagnoses the situation, and if needed contacts the WMAgent support (developers). But keeping up the infrastructure is not CMS limited Failure to just restarting Recovery some components: Cost it also involves checking the load on the machines and if necessary clean up disk space. An alert system is being developed in order to further automatize this task and it will be soon commissioned for production [9]. Available on CMS information server CMS CR -2012/070 Debugging site problems Given The Compact the distributed Muon Solenoid nature Experiment of the CMS computing model the debugging and monitoring Conference of site specific problems Report becomes a communication challenge. Site problems can be divided into two types: jobs cannot enter a site or jobs can run but fail. The Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland first one is solved by communicating the problem to the GlideIn factory support and debugging the problem with them. The second one involves the monitoring of failed jobs in the Global 10 May 2012 Monitor together with the information of Dashboard. Once the nature of the problem is found a site problem or a configuration problem it is then correctly propagated. To the site through the Savannah and GGUS ticket systems [11] in the first case or to the requestor in the latter one. Once the situation A is new understood era for central the request processing is either aborted and production (user configuration problems) or the submission to the site is stopped until the problem is understood and solved. in CMS Edgar Mauricio Fajardo Hernandez for the CMS Collaboration Closing-out requests Once a request is finished, all jobs have either succeeded or failed. At this moment the operator checks that the event count of the request is higher than 95% of the Abstract requested events for MC production and reprocessing and 100% for real data. If it is lower, The goal for CMS computing is to maximise the throughput of simulated event generation while also then another request must be created to recover the failed jobs; this task normally involves processing the real data events as quickly and reliably as possible. To maintain this achievement as communication withe quantity the of coordinators. events increases, since theonce beginningthe of 2011event CMS computing count has migrated is correct, at the the correct transfer Tier 1 level from its old production framework, ProdAgent, to a new one, WMAgent. The WMAgent subscription (to theframework corresponding offers improved processing custodial efficiency and Tier-1 increasedsite) resource usage is monitored as well as a reduction to be at 100%. Once the in manpower. operator agrees that the event count and transfers are fine, he sets the status of the request on In addition to the challenges encountered during the design of the WMAgent framework, several the ReqMgr as closed-out. operational issues have arisen during its commissioning. The largest operational challenges were in the usage and monitoring of resources, mainly a result of a change in the way work is allocated. Other tasks thatinstead areofmainly work being assigned done to operators, by the all work Team is centrally leaders injected andconsist managed in the ofrequest the following: tightly work Invited talk Grid2012 Conference, Dubna, Russia Manager system and the task of the operators has changed from running individual workflows to with the Integration team to commission new sites for MC production and to test new versions monitoring the global workload. 19 of ReqMgr, Global Workqueue and WMAgent. In this report we present how we tackled some of the operational challenges, and how we benefitted from the lessons learned in the commissioning of the WMAgent framework at the Tier 2 level in late As case studies, we will show how the WMAgent system performed during some of the large data reprocessing and Monte Carlo simulation campaigns.

20 ATLAS Failure Recovery Cost CPU- hours used to recover transient failures Job resubmission avoids data loss at the expense of CPU 0me used by the failed jobs Distribu0on of tasks ordered by CPU 0me used to recover transient failures is not uniform: Most of CPU 0me required for recovery was used in a small frac0on of tasks In 2010 reprocessing, the CPU 0me used to recover transient failures was 6% of the CPU 0me used for reconstruc0on In 2011 reprocessing, the CPU 0me used to recover transient failures was reduced to 4% of the CPU 0me used for the reconstruc0on ATLAS Production System, Ljubljana - June 20, 2012 Task Order 20

21 Waloddi Weibull 1939 Weibull published his paper on Weibull distribu0on in probability theory and sta0s0cs 1941 Weibull received a personal research professorship in Technical Physics at the Royal Ins0tute of Technology in Stockholm from the arms producer Bofors 1951 Weibull presented his most famous paper to the American Society of Mechanical Engineers (ASME) on Weibull distribu0on, using seven case studies 1972 American Society of Mechanical Engineers awarded Weibull the Gold Medal 1978 Royal Swedish Academy of Engineering Sciences awarded Weibull the Great Gold Medal The Weibull distribu0on is by far the world's most popular sta0s0cal model for produc0on data 21

22 Bi-modal Weibull Distribution: CPU-time Used to Recover Job Failures in ATLAS Data Reprocessing 100 Number of Tasks Majority of costs are from failures at the end of job 2011 improvements reduced major costs Log 10 (CPU-hours) 22

23 Time Overhead It takes about three million core- hours to processes one petabyte of ATLAS data Transient job failures and retries delay the reprocessing dura0on Op0miza0on of ATLAS Big Data processing workflow and other improvements cut the tails and halved the dura0on of the petabyte- scale reprocessing on the Grid from almost two months in 2010 to less than four weeks in 2011 Op0miza0on of fault- tolerance techniques vs. 0me overhead ( tails ) induced in task comple0on is an ac0ve area of research in Big Data processing on the Grid See the presenta0on by D. Golubkov at the ATLAS Compu0ng Workshop session today 23

24 Conclusions The emerging Big Data revolu0on drives new discoveries in scien0fic fields including nanotechnology, astrophysics, high- energy physics, biology and medicine In Big Data processing scien0sts deal with datasets, not individual files A task (comprised of many jobs) became a unit of Big Data processing on the Grid Splipng of the Big Data processing task into jobs enables fine- granularity checkpoin0ng at the level of a single job/file Fault- tolerance achieved through automa0c retries of the failed jobs induces a 0me overhead in the task comple0on, which is difficult to predict Reliability Engineering provides a framework for fundamental understanding of the Big Data processing on the Grid, which is not a desirable enhancement but a necessary requirement Inter- rela0onship of fault- tolerance techniques and performance predic0on is an ac0ve area of research in Big Data processing on the Grid 24

25 Extra

26 Slide from ICHEP: Which Experiments use Big Data? Progress: Scale of the Solution Progress in distributed computing and evolution of computing capacity WLCG processes 1.5M jobs on the grid per day More than 150PB of both disk and tape #,-!+" +!!!!!!!!" *!!!!!!!!" )!!!!!!!!" (!!!!!!!!" '!!!!!!!!"!"#$% #&'% ()!('% (!*#+%!" # $%&'('&)*+,-./01,23+$ 45!6"$7$)'8$9,2:2-,-/$-/;<$ &!!!!!!!!" %!!!!!!!!" $!!!!!!!!" #!!!!!!!!" Ian Fisk FNAL/CD!"./01#!" 2341#!" 5/61#!" 7861#!" 5/91#!".:01#!".:;1#!" 7:<1#!" =381#!" >?@1#!" ABC1#!" D3?1#!"./01##" 2341##" 5/61##" 7861##" 5/91##".:01##".:;1##" 7:<1##" =381##" >?@1##" ABC1##" D3?1##" 13 26

27 Big Data Processing in LHC Experiments Big Data processing on the Grid requires a balance between the CPU performance and the storage Pushing Bid Data limits, LHC compu0ng evolves from the hierarchy to the mesh IEEE NSS/MIC - October 24,

28 Tier 0 First Pass Data Processing Minimizing event losses, LHC experiments implemented mul0- pass processing for Big Data First pass data processing at Tier- 0 Reprocessing at Tier- 1 on the Grid CMS (right) reported first pass event losses at 10-8 level!"#$%"&'$()*+),- ATLAS (leo) tolerates event!"#$%&'()*+(($,$)-+./+. losses at 10-5 level during first pass processing Lost events are recovered during reprocessing 28