Solving the Mysteries of the Universe with Big Data

Similar documents
Big Data Analytics. for the Exploitation of the CERN Accelerator Complex. Antonio Romero Marín

Big Data and Storage Management at the Large Hadron Collider

CERN s Scientific Programme and the need for computing resources

Betriebssystem-Virtualisierung auf einem Rechencluster am SCC mit heterogenem Anwendungsprofil

Testing the In-Memory Column Store for in-database physics analysis. Dr. Maaike Limper

(Possible) HEP Use Case for NDN. Phil DeMar; Wenji Wu NDNComm (UCLA) Sept. 28, 2015

Big Data Needs High Energy Physics especially the LHC. Richard P Mount SLAC National Accelerator Laboratory June 27, 2013

CMS: Challenges in Advanced Computing Techniques (Big Data, Data reduction, Data Analytics)

AT&T Global Network Client for Windows Product Support Matrix January 29, 2015

Big Science and Big Data Dirk Duellmann, CERN Apache Big Data Europe 28 Sep 2015, Budapest, Hungary

Beyond High Performance Computing: What Matters to CERN

Science+ Large Hadron Cancer & Frank Wurthwein Virus Hunting

Analysis One Code Desc. Transaction Amount. Fiscal Period

COMPARISON OF FIXED & VARIABLE RATES (25 YEARS) CHARTERED BANK ADMINISTERED INTEREST RATES - PRIME BUSINESS*

COMPARISON OF FIXED & VARIABLE RATES (25 YEARS) CHARTERED BANK ADMINISTERED INTEREST RATES - PRIME BUSINESS*

Status of Grid Activities in Pakistan. FAWAD SAEED National Centre For Physics, Pakistan

NT1: An example for future EISCAT_3D data centre and archiving?

We invented the Web. 20 years later we got Drupal.

DSS. High performance storage pools for LHC. Data & Storage Services. Łukasz Janyst. on behalf of the CERN IT-DSS group

Performance Monitoring of the Software Frameworks for LHC Experiments

Data sharing and Big Data in the physical sciences. 2 October 2015

Case 2:08-cv ABC-E Document 1-4 Filed 04/15/2008 Page 1 of 138. Exhibit 8

ATLAS Petascale Data Processing on the Grid: Facilitating Physics Discoveries at the LHC

GRID computing at LHC Science without Borders

Computing at the HL-LHC

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab


Enhanced Vessel Traffic Management System Booking Slots Available and Vessels Booked per Day From 12-JAN-2016 To 30-JUN-2017

DSS. The Data Storage Services (DSS) Strategy at CERN. Jakub T. Moscicki. (Input from J. Iven, M. Lamanna A. Pace, A. Peters and A.

Forschungszentrum Karlsruhe in der Helmholtz - Gemeinschaft. Holger Marten. Holger. Marten at iwr. fzk. de

Performance monitoring of the software frameworks for LHC experiments

Object Database Scalability for Scientific Workloads

Integration of Virtualized Worker Nodes in Batch Systems

ADVANCEMENTS IN BIG DATA PROCESSING IN THE ATLAS AND CMS EXPERIMENTS 1. A.V. Vaniachine on behalf of the ATLAS and CMS Collaborations

Data and Storage Services

First-year experience with the ATLAS online monitoring framework

Data analysis in Par,cle Physics

Accelerating Experimental Elementary Particle Physics with the Gordon Supercomputer. Frank Würthwein Rick Wagner August 5th, 2013

Are you prepared to make the decisions that matter most? Decision making in manufacturing

Ashley Institute of Training Schedule of VET Tuition Fees 2015

Altix Usage and Application Programming. Welcome and Introduction

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

Storage strategy and cloud storage evaluations at CERN Dirk Duellmann, CERN IT

From Distributed Computing to Distributed Artificial Intelligence

HEP computing and Grid computing & Big Data

Introduction to Grid computing

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

CMS data quality monitoring web service

Linux and the Higgs Particle

No file left behind - monitoring transfer latencies in PhEDEx

TECHNOLOGIES FOR LARGE DATA MANAGEMENT IN SCIENTIFIC COMPUTING

Summer Student Project Report

IR Best Practice & the Tools Needed to Achieve it

The CMS analysis chain in a distributed environment

CMS Computing Model: Notes for a discussion with Super-B

Data Quality Monitoring. workshop

HPC data becomes Big Data. Peter Braam

Evolution of Database Replication Technologies for WLCG

Scalable stochastic tracing of distributed data management events

The Data Quality Monitoring Software for the CMS experiment at the LHC

Data Centric Computing Revisited

Running the scientific data archive

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

DOE LHC Quarterly Review. Magnet Production Cable Testing Cost to Complete issues. Mike Harrison, BNL July 26, 2004

Are you prepared to make the decisions that matter most? Decision making in retail

SEO Presentation. Asenyo Inc.

Next Generation Operating Systems

BCOE Payroll Calendar. Monday Tuesday Wednesday Thursday Friday Jun Jul Full Force Calc

Open access to data and analysis tools from the CMS experiment at the LHC

Web application for detailed realtime database transaction monitoring

Where is Fundamental Physics Heading? Nathan Seiberg IAS Apr. 30, 2014

Techniques for implementing & running robust and reliable DB-centric Grid Applications

Big data challenges for physics in the next decades

CENTERPOINT ENERGY TEXARKANA SERVICE AREA GAS SUPPLY RATE (GSR) JULY Small Commercial Service (SCS-1) GSR

CAFIS REPORT

PTC Creo 2.0 Hardware Support Dell

Scientific Storage at FNAL. Gerard Bernabeu Altayo Dmitry Litvintsev Gene Oleynik 14/10/2015

The CMS Tier0 goes Cloud and Grid for LHC Run 2. Dirk Hufnagel (FNAL) for CMS Computing

Oracle - Engineered for Innovation. Thomas Kyte

The Resilient Smart Grid Workshop Network-based Data Service

LHC Databases on the Grid: Achievements and Open Issues * A.V. Vaniachine. Argonne National Laboratory 9700 S Cass Ave, Argonne, IL, 60439, USA

Tier0 plans and security and backup policy proposals

Status and Evolution of ATLAS Workload Management System PanDA

a. mean b. interquartile range c. range d. median

Are you prepared to make the decisions that matter most? Decision making in healthcare

Big Data Challenges in Bioinformatics

A High-Performance Storage System for the LHCb Experiment Juan Manuel Caicedo Carvajal, Jean-Christophe Garnier, Niko Neufeld, and Rainer Schwemmer

LHC schedule: what does it imply for SRM deployment? CERN, July 2007

A successful publicprivate. Alberto Di Meglio CERN openlab Head

Storage Considerations for Database Archiving. Julie Lockner, Vice President Solix Technologies, Inc.

Detailed guidance for employers

Business Information Systems. IT Enabled Services And Emerging Technologies. Chapter 4: Facilitated e-learning Part 1 of 2 CA M S Mehta, FCA

Online data handling with Lustre at the CMS experiment

GARUDA - NKN Partner's Meet 2015 Big data networks and TCP

HTCondor at the RAL Tier-1

Accident & Emergency Department Clinical Quality Indicators

The Challenge of Handling Large Data Sets within your Measurement System

Reducing Storage TCO With Private Cloud Storage

Grab some coffee and enjoy the pre-show banter before the top of the hour!

Running a typical ROOT HEP analysis on Hadoop/MapReduce. Stefano Alberto Russo Michele Pinamonti Marina Cobal

Transcription:

Solving the Mysteries of the Universe with Big Data Sverre Jarp Former CERN openlab CTO Big Data Innovation Summit, Stockholm, 8 th May 2014 Accelerating Science and Innovation 1

What is CERN? The European Particle Physics Laboratory based in Geneva, Switzerland Current accelerator: The Large Hadron Collider (LHC) Founded in 1954 by 12 countries for fundamental physics research in a post-war Europe Today, it is a global effort of 21 member countries and scientists from 110 nationalities, working on the world s most ambitious physics experiments ~2 300 personnel, > 11 000 users, ~900 M budget 2

CERN openlab A unique research partnership between CERN and the ICT industry Objective: The advancement of cutting-edge computing solutions to be used by the worldwide Large Hadron Collider (LHC) community

WHY do we need a CERN?

Fundamental Physics Questions What is 95% of the Universe made of? We only observe a fraction! What is the rest? Why is there no antimatter left in the Universe? Nature should be symmetrical, or not? What was matter like during the first second of the Universe, right after the "Big Bang"? A journey towards the beginning of the Universe gives us deeper insight Why do particles have mass? Newton could not explain it the Higgs mechanism seems now to be the answer CERN has built the LHC to enable us to look at microscopic big bangs to understand the fundamental behaviour of nature

So, how do you get from this to this and this

Some facts about the LHC Biggest accelerator (largest machine) in the world 27 km circumference, 9300 magnets Fastest racetrack on Earth Protons circulate 11245 times/s (99.9999991% the speed of light) Emptiest place in the solar system high vacuum inside the magnets: Pressure 10-13 atm (10x less than pressure on the moon) World s largest refrigerator: -271.3 C (1.9K) Hottest spot in the galaxy During Lead ion collisions create temperatures 100 000x hotter than the heart of the sun; new record 5.5 Trillion K World s biggest and most sophisticated detectors 150 Million pixels Most data of any scientific experiment 15-30 PB per year

Detectors as big as 5-storey buildings 150 million sensors deliver data 40 million times per second 8

Tier 0 at CERN: Acquisition, First pass reconstruction, Storage & Distribution 1.25 GB/sec (ions) 9

What is this data? Raw data: Was a detector element hit? How much energy? What time? Reconstructed data: Particle type Origin Momentum of tracks (4-vectors) Energy in clusters (jets) Calibration information

Online Triggers and Filters Data Flow and Computatio n for Physics Analysis 100% (raw data, 6 GB/s) Offline Reconstruction Selection & reconstruction Event reprocessing Almost all of the application software is written in-house (i.e. by the physicists themselves) Event simulation Offline Analysis w/root Offline Simulation 10% (event summary) Interactive Analysis Batch Physics Analysis Processed data (active tapes) 1%

The LHC Computing Challenge Signal/Noise: 10-13 (10-9 offline) Data volume High rate * large number of channels * 4 experiments 15 Petabytes of new data each year 30 PB in 2012 Overall compute power Event complexity * Nb. events * thousands users 200 k cores 45 PB of disk storage Worldwide analysis & funding Computing funding locally in major regions & countries Efficient analysis everywhere GRID technology 350 k cores 150 PB 12

So, what are our issues with Big Data? Credit: A. Pace

July 4 th 2012 The Status of the Higgs Search J. Incandela for the CMS COLLABORATION July 4 th 2012 The Status of the Higgs Search J. Incandela for the CMS COLLAB H #gg candidate Evolution of the excess with time Ian.Bird@cern.ch / August 2012 14

World-wide LHC Computing Grid Tier 0 Tier 1 Tier 2 Today >140 sites >350k x86 PC cores >150 PB disk

Jan-09 Feb-09 Mar-09 Apr-09 May-09 Jun-09 Jul-09 Aug-09 Sep-09 Oct-09 Nov-09 Dec-09 Jan-10 Feb-10 Mar-10 Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 Apr-11 May-11 Jun-11 Jul-11 Aug-11 Sep-11 Oct-11 Nov-11 Dec-11 Jan-12 60000000 50000000 40000000 30000000 LHCb CMS ATLAS ALICE Processing on the grid 1.5M jobs/day Usage continues to grow - # jobs/day - CPU usage 20000000 10000000 ~ 150,000 years of CPU delivered each year 0 1E+09 This is close to full capacity We always need more! 900000000 800000000 700000000 600000000 500000000 400000000 300000000 200000000 LHCb CMS ATLAS ALICE 10 9 HEPSPEC -hours/month (~150 k CPU continuous use) 100000000 0 Jan-10 Feb-10 Mar-10 Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 Apr-11 May-11 Jun-11 Jul-11 Aug-11 Sep-11 Oct-11 Nov-11 Dec-11 16

Tier-0: Central Data Management Hierarchical Storage Management: CASTOR Rich set of features: Tape pools, disk pools, service classes, instances, file classes, file replication, scheduled transfers (etc.) DB-centric architecture Disk-only storage system: EOS Easy-to-use, stand-alone, disk-only for user and group data with inmemory namespace Low latency (few ms for read/write open) Focusing on end-user analysis with chaotic access Adopting ideas from other modern file systems (Hadoop, Lustre, etc.) Running on low-cost hardware (JBOD and SW RAID )

Active tapes Inside a huge storage hierarchy tapes may be advantageous! We use tape storage products from multiple vendors

CASTOR current status (end 2012) 66 petabytes across 362 million files 19

Big Data Analytics Huge quantity of data collected, but most of events are simply reflecting wellknown physics processes New physics effects expected in a tiny fraction of the total events: The needle in the haystack Crucial to have a good discrimination between interesting events and the rest, i.e. different species Complex data analysis techniques play a crucial role

ROOT Object-Oriented toolkit Data Analysis toolkit Written in C++ (millions of lines) Open source Integrated interpreter Multiple file formats I/O handling, graphics, plotting, math, histogram binning, event display, geometric navigation Powerful fitting (RooFit) and statistical (RooStats) packages on top In use by all our collaborations 21

RooFit/RooStats Standard tool for producing physics results at LHC Parameter estimation (fitting) Interval estimation (e.g limit results for new particle searches) Discovery significance (quantifying excess of events) Implementation of several statistical methods (Bayesian, Frequentist, Asymptotic) New tools added for model creation and combinations 22

ROOT files Default format for all our data Organised as Trees with Branches Sophisticated formatting for optimal analysis of data Parallelism, prefetching and caching Compression, splitting and merging Tree entries Streamer Branches Over 100 PB stored in this format (All over the world) 23

Big Data at CERN All of this needs to be covered!

Conclusions Big Data Management and Analytics require a solid organisational structure at all levels Must avoid Big Headaches Enormous file sizes and/or enormous file counts Data movement, placement, access pattern, life cycle Replicas, Backup copies, etc. Big Data also implies Big Transactions/Transaction rates Corporate culture: our community started preparing more than a decade before real physics data arrived Now, the situation is well under control But, data rates will continue to increase (dramatically) for years to come: Big Data in the size of Exabytes! There is no time to rest! 25

References and credits http://www.cern.ch/ http://wlcg.web.cern.ch/ http://root.cern.ch/ http://eos.cern.ch/ http://castor.cern.ch/ http://panda.cern.ch/ http://www.atlas.ch/ I am indebted to several of my colleagues at CERN for this material, in particular: Ian Bird, WLCG project Leader Alberto Pace, Manager of the Data Services Group at CERN and the members of his group 26