Grid and data management (and saving yourself time)



From this document you will learn the answers to the following questions:

What percent of failure rate is resubmit?

What does j mean?

Who did you know how to manage your files on the grid?

Similar documents
BBC Learning English Talk about English Business Language To Go Part 1 - Interviews

Monitoring the team s performance

Cleaning Up Your Outlook Mailbox and Keeping It That Way ;-) Mailbox Cleanup. Quicklinks >>

User-friendly access to Grid and Cloud resources for 18scientific th 19 th computing January / 21

TeachingEnglish Lesson plans

Getting Started with WebSite Tonight

Distributed Computing for CEPC. YAN Tian On Behalf of Distributed Computing Group, CC, IHEP for 4 th CEPC Collaboration Meeting, Sep.

NPI Number Everything You Need to Know About NPI Numbers

Automated Offsite Backup with rdiff-backup

Werkzeuge zur verteilen Analyse im ATLAS-Experiment

5544 = = = Now we have to find a divisor of 693. We can try 3, and 693 = 3 231,and we keep dividing by 3 to get: 1

Supplemental Activity

Getting Started with Turbo Your PC

Opus: University of Bath Online Publication Store

Introduction to Programming and Computing for Scientists

Point and Interval Estimates

This manual will also describe how to get Photo Supreme SQLServer up and running with an existing instance of SQLServer.

Do you wish you could attract plenty of clients, so you never have to sell again?

Grid Engine 6. Troubleshooting. BioTeam Inc.

Top Ten Mistakes in the FCE Writing Paper (And How to Avoid Them) By Neil Harris

Tier 1 Services - CNAF to T1

CEFNS Web Hosting a Guide for CS212

WRITING PROOFS. Christopher Heil Georgia Institute of Technology

Using SVN to Manage Source RTL

Lesson 26: Reflection & Mirror Diagrams

ASVAB Study Guide. Peter Shawn White

LHCb Software Installation Tools. Stuart K. Paterson Ganga Workshop (Tuesday 14th June) 1

University of Hull Department of Computer Science. Wrestling with Python Week 01 Playing with Python

Getting Started With MySaleManager.NET

1 Description of The Simpletron

Configuring the Server(s)

Anger Management Course Workbook. 5. Challenging Angry Thoughts and Beliefs

Grid 101. Grid 101. Josh Hegie.

How To Proofread

What you should know about: Windows 7. What s changed? Why does it matter to me? Do I have to upgrade? Tim Wakeling

RO-11-NIPNE, evolution, user support, site and software development. IFIN-HH, DFCTI, LHCb Romanian Team

CAs and Turing Machines. The Basis for Universal Computation

5 Group Policy Management Capabilities You re Missing

PhEDEx. Physics Experiment Data Export. Chia-Ming, Kuo National Central University, Taiwan ( a site admin/user rather than developer)

Life after Lotus Notes

WINDOWS AZURE EXECUTION MODELS

How to present to DESIGN SPECIFIERS. An article by:

How to Install SQL Server 2008

The phrases above are divided by their function. What is each section of language used to do?

Affiliate Marketing: How 30 minutes of work made me $100 overnight. By A. Geoff Barnett

Chapter 2. Making Shapes

EMPOWERING YOURSELF AS A COMMITTEE MEMBER

Writing Thesis Defense Papers

100 Ways To Improve Your Sales Success. Some Great Tips To Boost Your Sales

Version Control with. Ben Morgan

CSC 120: Computer Science for the Sciences (R section)

Centralized Disaster Recovery using RDS

Creating an e mail list in YahooGroups

Using the Push Notifications Extension Part 1: Certificates and Setup

Introduction to SQL for Data Scientists

Jan Brzeski s Introductory List-Building

CREATIVE S SKETCHBOOK

Fibonacci Numbers and Greatest Common Divisors. The Finonacci numbers are the numbers in the sequence 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144,...

How to Avoid the 10 BIGGEST MISTAKES. in Voice Application Development

When you are contacting your leads it s very important to remember a few key factors:

Getting Started With Facebook

MAIL MERGE TUTORIAL. (For Microsoft Word on PC)

The Doctor-Patient Relationship

The glite Workload Management System

Cloud: It s not a nebulous concept

Team Foundation Server 2012 Installation Guide

2.2 Derivative as a Function

Table of Contents. FleetSoft Installation Guide

Interviewer: Sonia Doshi (So) Interviewee: Sunder Kannan (Su) Interviewee Description: Junior at the University of Michigan, inactive Songza user

Remote Access for the NCSU Virtual Computing Lab Robert E. Young

Introducing Xcode Source Control

Documenting your research: logbooks, online reports, code archive

Measuring the Impact of Volunteering

TIPS TO HELP YOU PREPARE FOR A SUCCESSFUL INTERVIEW

NorduGrid ARC Tutorial

WebSphere Application Server: Configuration and Use of the Job Manager Component

HOW TO GET THE MOST OUT OF YOUR ACCOUNTANT

The Windows Command Prompt: Simpler and More Useful Than You Think

How to Bank and Save In Canada

Using Mail Merge in Microsoft Word 2003

PART I. The Mechanics of Mastering

Jazz Source Control Best Practices

The 5 P s in Problem Solving *prob lem: a source of perplexity, distress, or vexation. *solve: to find a solution, explanation, or answer for

Planning a Class Session

LHCb activities at PIC

QAD Enterprise Applications. Training Guide Demand Management 6.1 Technical Training

Resource Monitoring During Performance Testing. Experience Report by Johann du Plessis. Introduction. Planning for Monitoring

Image Credit:


Live Agent for Support Agents

May 25th, "Car Buying: How to Avoid the Extra Stress"--Mary Dittfurth

Multiplication Rules! Tips to help your child learn their times tables

Summer Student Report - Summer 2013 DBaaS with OpenStack Trove. Student: Andrea Giardini Supervisors: Giacomo Tenaglia - Ewan Roche

How to Outsource Without Being a Ninnyhammer

Assessing Speaking Performance Level B2

How to create even more authority and presence using the web. Your ultimate weapon to getting and sustaining change.

An objective comparison test of workload management systems

Managing your accounts

SQL Server Instance-Level Benchmarks with DVDStore

Transcription:

Grid and data management (and saving yourself time) Robert W. Lambert Rob Lambert, CERN LHCb Week, Dec 2011 1

Introduction Prerequisites Good knowledge of Python Grid certificate Have done the Ganga tutorials Have a LHCb Ganga job which makes some output files Learning outcomes Achieve 100% success rate on the grid Reduce the mails to the distributed analysis list You will know how to manage your files on the grid Rob Lambert, CERN LHCb Week, Dec 2011 2

Outline The Grid When to use the grid Expectations Getting the most out of the grid Data management Where are my data? What can I do with my data? Cleaning up after yourself Rob Lambert, CERN LHCb Week, Dec 2011 3

Common misconceptions The grid is slow Oh it really isn t The grid never works It works just fine 80% of the time It s hard to find the data The BK is very easy (Tutorial) The grid is only for experts That s why we have Ganga (Tutorial) Ganga can t do Yes, but it can do a zillion easier things Why do you want to do *that* anyway? Rob Lambert, CERN LHCb Week, Dec 2011 4

Common (Bad) Model Are you guilty of: Options from someone else or never tested Script from someone else or the same as last time Submit straight to Grid Failure mailing list or Give up! Go to batch. Computer says no. Rob Lambert, CERN LHCb Week, Dec 2011 5

Better Model Test and understand your code! Stick with the Ganga tutorials Adapt a similar tutorial Think, Ask, Test,Test, Test again Submit Local jobs, Batch jobs, Grid jobs Success! Tidy up, Next! Rob Lambert, CERN LHCb Week, Dec 2011 6

The Ganga Mantra Configure once, run anywhere! Rob Lambert, CERN LHCb Week, Dec 2011 7

The Ganga Mantra Configure once, run anywhere! There s a reason Ganga is programmed this way: Test locally and with small sets of jobs before submitting 10k! This will solve 99% of your Grid problems Rob Lambert, CERN LHCb Week, Dec 2011 8

Success or Failure? Failure means: Something went physically wrong with your job on the grid Maybe, though, it was after the important step Completed means: The job has stopped It doesn t mean it reached the end, or that it did anything at all Many segfaulting jobs still complete Success is defined by you: That ntuple was produced and is readable That line of stdout was printed All input files were seen and processed The integrated lumi calculated is >0 Rob Lambert, CERN LHCb Week, Dec 2011 9

Failures Success rate You should expect a finite success rate <100% With adequate testing this is usually 80% on the first pass Most emails to the list are one out of my 100 jobs fails Wow! That s 99% success!! Congratulations!!! ZOMG Remaining failures Glitch at the site Ran out of CPU Data were not available at that exact moment If 0% succeed, you ve done something wrong If 0% succeed at a given site, something else is wrong Rob Lambert, CERN LHCb Week, Dec 2011 10

Reaching 100% Follow a simple procedure 80% Check one of the failed jobs Weird failure Stopped in the middle of the job Skipped a lot of files Some error on reading a file Try just resubmitting it 96% Increase CPU and resubmit 99.2% Resubmit with the previous site banned 99.8% Gather up failures into a new job, resplit, resubmit 99.97% In [1]: job( 7.99 ).resubmit() In [2]: job( 7.99 ).backend.settings={ CPUTime : *2} In [3]: job( 7.99 ).backend.settings={ BannedSites :[ LCG.CERN.CH ]} In [4]: j=jobs(7).copy() In [5]: j.inputdata=jobs( 7.99 ).inputdata In [6]: j.splitter.filesperjob=1+len(j.inputdata.files)/10 In [7]: j.submit() Rob Lambert, CERN LHCb Week, Dec 2011 11

CPUTime Typical queues are in wall-clock time: 1 hour, 8 hours, 24 hours, 48 hours If you use part of a slot, don t worry, DIRAC will fill it up again CPU on the grid is in HEPSPEC06 units Performance of the machine * time spent Roughly ½ the speed of an lxplus node CPU depends on the input One subjob can take 50% of the total CPU CPU reported depends on the machine Not perfectly normalized, sometimes multiplied by #cores! Rule of thumb: (4-8) times the estimate from your local job Rob Lambert, CERN LHCb Week, Dec 2011 12

Part II Yeah! 100% of my jobs succeeded where are my data? Rob Lambert, CERN LHCb Week, Dec 2011 13

Input There are two types of input 1 2 Input Sandbox (j.inputsandbox) Input Data (j.inputdata) Small files. Need to be copied from your local machine, with the job, to the input directory of the job. e.g. options, scripts, code Large files. Not copied from your local machine. May exist only where the job is supposed to run. e.g. data from the experiment! Rob Lambert, CERN LHCb Week, Dec 2011 14

Output There are two types of output 1 2 Ouput Sandbox (j.outputsandbox) Ouput Data (j.outputdata) Small files. Need to be copied once the job has completed from the host machine, to your local machine. e.g. stdout, stderr, Histograms Large files. Not copied to your local machine. Uploaded to the GRID SE or to CERN CASTOR e.g. Massive ntuples, DSTs to share Rob Lambert, CERN LHCb Week, Dec 2011 15

Datasets inputdata, and outputdata, are Datasets in Ganga A dataset is a collection of data files, of one of two types: 1 2? Physical Files You know where the file is, and how to access it. It is located on a disk, this is the actual file you want your job to run over. pf = PhysicalFile( /disk/some/pfn.file ) Logical Files The file may be anywhere on the GRID. The access may be through some obscure protocol, and might be authenticated. The Logical File Name (LFN) is just the name of the file on the GRID, not where it is! lf = LogicalFile( /lhcb/some/lfn.file ) Rob Lambert, CERN LHCb Week, Dec 2011 16

Let s move some data Ganga makes this easy In [1]: j=jobs(7).copy() In [2]: j.outputsandbox Out[1]: [ DVHistos.root, DVnTuples.root ] In [3]: j.outputsandbox= [ DVHistos.root ] In [4]: j.outputdata=[ DVnTuples.root ] In [5]: j.outputdata.location= GridTutorial In [6]: j.submit() Copy data around In [1]: ds=j.backend.getoutputdatalfns() In [2]: ds.replicate( CERN-USER ) In [3]: ds[0].download( /tmp/ ) In [4]: afile=physicalfile( /tmp/dvntuples.root ) In [5]: dscp=afile.upload( /lhcb/user/<u>/<uname>/gridtutorial/dvntuples.root ) In [6]: j.backend.getoutputdata() Rob Lambert, CERN LHCb Week, Dec 2011 17

Where are my data? In ganga In [1]: j.application.outputdir In [2]: j.peek() In [3]: ds=j.backend.getoutputdatalfns() In [4]: reps=ds.getreplicas() In [5]: reps[ds[0].name] Out[9]: {'CERN-USER': 'srm://srmlhcb.cern.ch/castor/cern.ch/grid/lhcb/user/r/rlambert/2010_10/1 1645/11645816/2010.RedoStripping_B0q2DplusMuXTuned.dst', 'IN2P3-USER': 'srm://ccsrm.in2p3.fr/pnfs/in2p3.fr/data/lhcb/user/r/rlambert/2 010_10/11645/11645816/2010.RedoStripping_B0q2DplusMuXTuned.dst} How much space is that using? (out of ganga) $ SetupProject LHCbDirac $ dirac-dms-storage-usage-summary --Dir /lhcb/user/<u>/<username> DIRAC SE Size (TB) Files -------------------------------------------------- CERN-USER 1.3 2230 CERN-tape 0.0 215 Rob Lambert, CERN LHCb Week, Dec 2011 18

How much data can I keep? Depends where it is: Your afs home directory: almost nothing The disk on your computer: as much as you like The disk at your institute: as much as they let you Your CERN CASTOR_HOME directory: Well, lots, eventually it will be sent to tape Your GRID storage: Keep <2 TB, depending on activity So please don t: Merge central files (nor run on unmerged files) Run your own private MC generation Run your own private reconstruction/stripping Forget to clean up after yourself Rob Lambert, CERN LHCb Week, Dec 2011 19

Cleaning up In ganga: In [1]: ds=j.backend.getoutputdatalfns() In [2]: for d in ds...: d.remove() I don t want to keep the jobs, but want to keep the output In [1]: ds=j.backend.getoutputdatalfns() In [2]: box.add(ds, j.id+ +j.name+ Output LFNs ) In [3]: j.remove() I don t have the jobs any more (out of ganga) $ SetupProject LHCbDirac $ dirac-dms-user-lfns $ dirac-dms-remove-files <a-list-of-lfns> https://twiki.cern.ch/twiki/bin/view/lhcb/gridstoragequota Rob Lambert, CERN LHCb Week, Dec 2011 20

Running on Existing Data Your own DSTs Too large to store locally? Need to share with others? No need to download the DSTs You may want to replicate them to more than 1 site Run directly from the Grid with the LFNs Your own ntuples Hopefully small enough for InputSandbox No? Download, merge, upload or copy to Castor, clean up In [1]: help(j.backend.getoutputdata) In [2]: help(rootmerger) Access directly from castor (only a little slower than local) [1] Tfile* f=tfile::open( castor:/castor/... ) [2] TBrowser b Rob Lambert, CERN LHCb Week, Dec 2011 21

Hands-on 1. Ganga tutorials First exposure to Ganga 2. DaVinci tutorials Do something useful, make some output files 3. Ganga, grid and data-management (this talk) Submit some things to the grid Juggle output locations Manage your data Remove your data https://twiki.cern.ch/twiki/bin/view/lhcb/gridanddatamanagement Rob Lambert, CERN LHCb Week, Dec 2011 22

Final Tips Use the interactive prompt (as in the tutorials) Once you have one job which works, you re ready for anything Use the summary.xml: Like stdout only much much smaller (you can remove stdout most of the time) https://twiki.cern.ch/twiki/bin/view/lhcb/davincitutorial0 Ganga utilities A user-driven set of python functions for Ganga Really saves time, since we all do the same things (examples in this tutorial are mostly one line) https://twiki.cern.ch/twiki/bin/view/main/lhcbedinburghgroupgangautils Rob Lambert, CERN LHCb Week, Dec 2011 23

What you ve learnt 1. Ganga makes your life easier 2. Ganga is not a replacement for your brain 3. How to be effective on the GRID 4. How to manage your grid data Rob Lambert, CERN LHCb Week, Dec 2011 24

End Backups are often required Rob Lambert, CERN LHCb Week, Dec 2011 25