Open access to data and analysis tools from the CMS experiment at the LHC Thomas McCauley (for the CMS Collaboration and QuarkNet) University of Notre Dame, USA thomas.mccauley@cern.ch! 5 Feb 2015
Outline CMS at the LHC 1st public release of CMS data CMS masterclasses Large data release Open data portal Outlook and future plans
CMS at the LHC CMS (Compact Muon Solenoid) is one of the two general-purpose experiments at the LHC Over 350 papers published describing searches for SUSY and exotica, measurements of QCD, electroweak, top, b, forward, and heavy-ion physics, as well as the discovery of the Higgs boson and its properties Collected ~ 28 1/fb of proton-proton collision data at COM energies up to 8 TeV Nearly 3000 physicists and ~800 engineers from over 40 countries http://cern.ch/cms
CMS public data (i) The CMS experiment has allowed the release of the following data to the public for use in education and outreach: 2000 events each of J/ψ μμ, J/ψ ee! 2000 events each of Υ μμ, Υ ee$ 500 events each of Z μμ, Z ee! 1000 events each of W μν, W eν! 100,000 events each of di-muon, di-electron, and di-jet events in the energy range 2-110 GeV! 19 Higgs candidate events: 10 γγ, 1 2e2μ, 1 4e, 1 4μ, 2 bb, 2 ττ, 2 WW in the mass range 120-130 GeV! ~50 1/pb single muons for top quark analysis Bold: indicates datasets already delivered and/or in use These data form the core of the masterclasses
CMS public data (ii)
Masterclasses Masterclasses: students travel to nearby universities and research laboratories to listen to lectures, analyze real LHC data, and interact with other groups via videoconference. International masterclasses organized under the auspices of IPPOG, the International Particle Physics Outreach Group (http:// ippog.web.cern.ch) with central organization at TU Dresden and Notre Dame. In 2014 (from Feb 12 - Apr 12) there were 69 CMS masterclasses in 26 countries in 12 languages. CMS masterclass developed in collaboration with QuarkNet (http://quarknet.fnal.gov) Current CMS exercise: W+:W-, Z, J/ψ, and Y invariant mass
CMS masterclasses in 2014 https://quarknet.i2u2.org/content/running-cms-wzh-path-masterclass! http://cms.physicsmasterclasses.org/cms.html
CMS masterclasses in 2014
CMS masterclasses
2014 CMS masterclass exercise Students use up to 30 separate datasets each with 100 events containing samples from the W, Z, and di-lepton events (one 4-lepton and two di-photon Higgs candidate events included) Each group views in an event display up to 100 events and attempts to determine whether or not it is a W or Z (di-lepton) event. If a W, did it decay into an electron and a neutrino or into a muon and a neutrino? What is the charge of the lepton? If a Z, is it di-electron or di-muon? What is the invariant mass? What is the W+:W- ratio? What does it mean for proton and its structure? What does the invariant mass spectrum look like? (There will be several unexpected peaks from the di-lepton background) 2015: content the same data analysis tools improved (covered later); what follows shows exercise of 2014
After an introduction by moderator covering HEP and the experiment, start by opening the event display: Browser-based event display written in JavaScript
Select a set of 100 W, Z, J/ψ, and Y events (each with a Higgs candidate included)
electron? significant MET? Therefore, it s a W to e nu event? But is it an e+ or e-?
The electron seems to curve clock-wise, so therefore e+
Mark the answer on the spreadsheet (hosted on Google docs): Mark as a W+ e+ν candidate
muon! muon! Therefore a Z μ+μ- candidate?
In the 2014 masterclasses......students correctly identified an event as a Z candidate (i.e. an event with 2 leptons) 92% of the time...students correctly identified a electron 90% of the time and a muon 93% of the time...students correctly identified an event as a W 91% of the time...when the students correctly identified an event as W μν (W eν), they correctly identified the charge 84% (81%) of the time. 11% (16%) of these events were assigned no charge
2014 results CMS value http://cds.cern.ch/record/1646590
2014 student results
Videoconference Students communicate and discuss results with other masterclass groups using Vidyo http://cern.ch/vidyo with support from CERN and FNAL IT: A recorded videoconference: http://cds.cern.ch/record/1693152
For 2015 Exercise to remain the same New IPPOG masterclasses start next month Masterclasses for CERN visitors start next week New browser-based tool developed by RWTH Aachen will replace Google spreadsheets and include creation of plots on-the-fly New event display! Beyond 2015: new opportunity to use open data from CMS to develop new exercises in the future
http://cern.ch/cms-masterclass/ispy-webgl
https://www.i2u2.org/elab/cms/cima/index.php Web-based data entry and histogram tool developed by RWTH Aachen
CMS Open Data policy CMS has drafted and adopted a data preservation, re-use, and open-access policy which includes: Commitment to publication in open-access journals Release of data to the public Preservation and release of software and documentation needed for reconstruction and analysis In the future: a commitment to release data after a suitable embargo period https://cms-docdb.cern.ch/cgi-bin/publicdocdb/showdocument?docid=6032
New release (i) The new release of CMS data is much larger and more extensive than previous releases: Half of reconstructed data from 2010 proton-proton collisions at 7 TeV (tens of 1/pb) ~ 30 TB in size In CMS Analysis Object Data (AOD) format (ROOT files)
New release (ii)
CMS AOD Contains information needed for an analysis such as physics objects, tracks, calo hits, vertices, trigger info, etc. ROOT-based format needing CMSSW in order to read and analyze Q: How can/will the public handle such a dataset? A (partially): Initially focus on an already-proven, successful use-case: education and outreach
How does one get from...
...to
...or to
Open Data Portal Data and tools and resources for analysis has been made available via an open data portal Portal is divided into two main areas: Education and Research Datasets are distinguished as either primary or derived Philosophy: include and build upon the previous and current success of public data in education and outreach but also include the possibility for more in-depth, complex analysis Built with Invenio digital library software: http://invenio-software.org The portal is a collaboration between CERN, CMS, ATLAS, ALICE, and LHCb: what follows is a description of the CMS content
http://opendata.cern.ch
http://press.web.cern.ch/press-releases/2014/11/cern-makes-public-first-data-lhc-experiments
Education
Education
Derived dataset record A derived dataset is a dataset that has been created from a primary dataset and contains reduced information (like four-vectors) Software with which to create the derived datasets is provided Analysis of derived datasets does not require special CMS software (but production of derived datasets might)
Education: histogram tool
Education: histogram tool
Education: event display
Education: example analysis
Research
CMS-specific CERN VM Analysis of primary datasets requires CMSSW environment; we provide it in a virtual machine image VM contains SLC5, CMS software environment, access to primary datasets via XRootD Example code also available via GitHub
Primary dataset record (i)
Primary dataset record (ii)
CMS External Resources
Invenio and CERN support Open data portal built with Invenio (a familiar example of an application using Invenio is CERN Document Server http://cdsweb.cern.ch) Invenio provides document organization, search capability, and handling of metadata The portal relies on CERN support and services for data storage, access to and distribution of data, and security and bandwidth restrictions
Data re-use Data released under the Creative Commons CC0 waiver: essentially releasing it into the public domain http://creativecommons.org/publicdomain/ zero/1.0 Data are identified with digital object identifiers (DOI) and it is expected that third parties will access the data using these
Outlook CMS public data has reached thousands of students all over the world via CMS masterclasses Re: open data portal We can conclude that about ~82k distinct users visited our site since the launch, out of which ~600 people downloaded EOS files over HTTP, ~5k read About pages, ~21k viewed collections, ~16k used event display, ~3k used histogramming, ~21k viewed records, and ~10k used search. - T. Simko (Invenio team)19 Dec 2014 Next: Improve tools and with new, large data release develop new E&O programs
Thank you