Big Data Needs High Energy Physics especially the LHC. Richard P Mount SLAC National Accelerator Laboratory June 27, 2013



Similar documents
(Possible) HEP Use Case for NDN. Phil DeMar; Wenji Wu NDNComm (UCLA) Sept. 28, 2015

Testing the In-Memory Column Store for in-database physics analysis. Dr. Maaike Limper

ATLAS Petascale Data Processing on the Grid: Facilitating Physics Discoveries at the LHC

Concepts in Theoretical Physics

Big Data Analytics. for the Exploitation of the CERN Accelerator Complex. Antonio Romero Marín

Data analysis in Par,cle Physics

Big Data Processing Experience in the ATLAS Experiment

Online CMS Web-Based Monitoring. Zongru Wan Kansas State University & Fermilab (On behalf of the CMS Collaboration)

Save Time and Money with Quantum s Integrated Archiving Solution

HIGH ENERGY PHYSICS EXPERIMENTS IN GRID COMPUTING NETWORKS EKSPERYMENTY FIZYKI WYSOKICH ENERGII W SIECIACH KOMPUTEROWYCH GRID. 1.

EMC Virtual Infrastructure for Microsoft Applications Data Center Solution

Summer Student Project Report

Evolution of Database Replication Technologies for WLCG

Growth of Unstructured Data & Object Storage. Marcel Laforce Sr. Director, Object Storage

Building a Scalable Big Data Infrastructure for Dynamic Workflows

Leveraging OpenStack Private Clouds

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Data Requirements from NERSC Requirements Reviews

Running a typical ROOT HEP analysis on Hadoop/MapReduce. Stefano Alberto Russo Michele Pinamonti Marina Cobal

Accelerating Experimental Elementary Particle Physics with the Gordon Supercomputer. Frank Würthwein Rick Wagner August 5th, 2013

THE VIRTUAL DATA CENTER OF THE FUTURE

Using S3 cloud storage with ROOT and CernVMFS. Maria Arsuaga-Rios Seppo Heikkila Dirk Duellmann Rene Meusel Jakob Blomer Ben Couturier

UNDERSTAND YOUR CLIENTS BETTER WITH DATA How Data-Driven Decision Making Improves the Way Advisors Do Business

From Distributed Computing to Distributed Artificial Intelligence

Automated file management with IBM Active Cloud Engine

EMC DATA DOMAIN EXTENDED RETENTION SOFTWARE: MEETING NEEDS FOR LONG-TERM RETENTION OF BACKUP DATA ON EMC DATA DOMAIN SYSTEMS

FTK the online Fast Tracker for the ATLAS upgrade

BUILDING A SCALABLE BIG DATA INFRASTRUCTURE FOR DYNAMIC WORKFLOWS

From raw data to Pbytes on disk The world wide LHC Computing Grid

INTRODUCING AZURE MACHINE LEARNING

A Best Practice Guide to Archiving Persistent Data: How archiving is a vital tool as part of a data center cost savings exercise

The Science DMZ. Eli Dart, Network Engineer Joe Metzger, Network Engineer ESnet Engineering Group. LHCOPN / LHCONE meeting. Internet2, Washington DC

Best Practices for Deploying Citrix XenDesktop on NexentaStor Open Storage

HEP computing and Grid computing & Big Data

Availability Digest. Data Deduplication February 2011

DIGITAL UNIVERSE UNIVERSE

361 Computer Architecture Lecture 14: Cache Memory

Leveraging Public Clouds to Ensure Data Availability

ANY SURVEILLANCE, ANYWHERE, ANYTIME

PHYSICS WITH LHC EARLY DATA

Tips for Avoiding Bad Financial Advice

The new frontier of the DATA acquisition using 1 and 10 Gb/s Ethernet links. Filippo Costa on behalf of the ALICE DAQ group

Best Practices for Microsoft

Invenio: A Modern Digital Library for Grey Literature

The Microsoft Large Mailbox Vision

The Benefits of Virtualization for Your DR Plan

US NSF s Scientific Software Innovation Institutes

A Binary Tree SMART Migration Webinar. SMART Solutions for Notes- to- Exchange Migrations

IBM Storwize V5000. Designed to drive innovation and greater flexibility with a hybrid storage solution. Highlights. IBM Systems Data Sheet

How to handle Out-of-Memory issue

Data Centric Computing Revisited

EMC Virtual Infrastructure for Microsoft Applications Data Center Solution

Altaro VM Backup vs. Microsoft s System Center DPM

Keys to Successfully Architecting your DSI9000 Virtual Tape Library. By Chris Johnson Dynamic Solutions International

What is LOG Storm and what is it useful for?

Money and the Single Parent. Apprisen

Rapid Application Development for Machine Vision A New Approach

CHESS DAQ* Introduction

FCC JGU WBS_v0034.xlsm

Transcription:

Big Data Needs High Energy Physics especially the LHC Richard P Mount SLAC National Accelerator Laboratory June 27, 2013

Why so much data? Our universe seems to be governed by nondeterministic physics One measurement tells us very little However carefully we set up an experiment, nondeterministic physics decides what we observe If we want to observe something rare, we may have to find a few occurrences (events) hidden in vast numbers of other events The LHC experiments see bunch collisions (tens or hundreds of superimposed events) every 25 ns. That is 40 million/second or about 15 trillion bunch collisions per year 6/27/2013 Big Data in High Energy Physics 2

Example Finding the Higgs CMS and ATLAS at the LHC CMS Higgs Candidate 6/27/2013 Big Data in High Energy Physics 3

ATLAS Higgs Candidates 6/27/2013 Big Data in High Energy Physics 4

How Much Data? Raw analog data rate from an LHC detector About one Petabyte per second One $trillion per year for storage alone OK, wrong answer, what s the right question? 6/27/2013 Big Data in High Energy Physics 5

Philosophical Aside What are your data handling and computing requirements? WRONG QUESTION What could the data handling and computing technology that you can afford allow you to do? RIGHT QUESTION 6/27/2013 Big Data in High Energy Physics 6

So What Should We Do With One Petabyte/Second? In real time, throw away the data not needed to make discoveries But, the discoveries with the greatest impact are those we don t expect But, we are smart, no? We really do throw away 99.9999% of LHC data before writing it to persistent storage 6/27/2013 Big Data in High Energy Physics 7

LHC Data Acquisition (DAQ) CMS Example Filter Systems aka High Level Trigger WLCG 6/27/2013 Big Data in High Energy Physics 8

ATLAS High-Level Trigger (Part) Total of 15,000 cores in 1,500 machines Disk Buffer WLCG 6/27/2013 Big Data in High Energy Physics 9

Planning For the LHC circa 2005 6/27/2013 10

Simulation in HEP The fundamental physics processes are only measurable after non-invertible transformations from Physics (e.g. quarks observable particles) Detector (noise, limited resolution, limited granularity) Software (Imperfect pattern recognition, confusion, bugs) Simulate billions of events, apply these transformations, and compare with the observed data Simulating the needle is much easier than simulating the haystack Need much more simulation than data to make the uncertainty contribution from simulation statistics negligible. 6/27/2013 Big Data in High Energy Physics 11

And then what happens? Real and Simulated Data Data-Intensive processing and physics analysis at ~100 computer centers and ~1000 universities worldwide 6/27/2013 Big Data in High Energy Physics 12

100 Computer Centers are you crazy? Political/funding/greed requirement for a distributed system A centralized system might divide unit costs by a factor 2 But it might divide funding by a factor 4 at least Some real benefits of the distributed system Involves scientists worldwide and distributes expertise Eases access to major opportunistic resources 13 HEP Computing in the Future

LHC Reality 2012 (Mainly ATLAS and CMS) Sites Number CPU Cores Disk Petabytes Tier 0 (CERN) 1 27,000 22 Tier 1 11 82,000 79 Tier 2 66 215,000 94 TOTAL 78 324,000 196 Many Tier 2s are implemented as multi-site centers Conundrum: ATLAS has been taking data at ~300 MB/s, running for typically 2,000 hours/year since November 2009 About 3 PB/year maximum of raw data By the Summer of 2011, all the ATLAS disks were full. 6/27/2013 14

How does less than 3 PB become 100 PB 1. Duplicate the raw data (not such a bad idea) 2. Add a similar volume of simulated data (essential) 3. Make a rich set of derived data products (some of them larger than the raw data) 4. Re-create the derived data products whenever the software has been significantly improved (several times a year) and keep the old versions for quite a while 5. Place up to 40 copies of the derived data around the world so that when you send the computing to the data you can send it almost anywhere 6. Do the math! now far fewer copies due to demand-driven temporary replication but much more reliance on wide-area networks 6/27/2013 15

Final Aside Data Rates and Technologies I have been deeply involved in many dataintensive high energy physics experiments Since the early 80s, I have been personally involved in buying stuff computers, storage and networks for these experiments What picture emerges? Some of the network purchases were made by Harvey Newman/Caltech 6/27/2013 Big Data in High Energy Physics 16

HEP Data versus Technologies 1.0E+08 1.0E+07 1.0E+06 1.0E+05 1.0E+04 Farm CPU box KSi2000 per $M Raid Disk GB/$M Transatlan c WAN kb/ s per $M/yr Disk Access/s per $M 1.0E+03 ATLAS Data Rate kb/s 1.0E+02 1.0E+01 BaBar Data Rate kb/s EMC Data Rate kb/s 1.0E+00 6/27/2013 Big Data in High Energy Physics 17 1980 1985 1990 1995 2000 2005 2010 2015

HEP Summary (almost) Hundreds of Petabytes of data Worldwide automated processing and data flow But, fortunately, we are losing our leadership position in the data-intensive world 6/27/2013 Big Data in High Energy Physics 18