New Design and Layout Tips For Processing Multiple Tasks

Novel, Highly-Parallel Software for the Online Storage System of the ATLAS Experiment at CERN: Design and Performances Tommaso Colombo a,b Wainer Vandelli b a Università degli Studi di Pavia b CERN IEEE Real-Time Conference, 12 June 2012 T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 1 / 13

The ATLAS Trigger & DAQ System Event rates design (2012 peak) 40 MHz (20 MHz) 2.5 µs 75 khz (~65 khz) Level 1 Trigger Custom Hardware Regions of Interest Level 1 Accept Calo/ Muon FE DAQ Other FE Other FE ROD ROD ROD Detector Readout Data rates design (2012 peak) ATLAS Event 1.5 MB/25 ns (1.6 MB/50ns) ~40 ms (~45 ms) 3 khz (~5.5 khz) ~4 s (~1 s) ~200 Hz (~800 Hz) Level 2 Event Filter ~5000 dsa Processing Unit ~5000 Processing Unit ROI data L2 Accept Full events EF Accept Data Collection Network Event Filter Network CERN Permanent Storage ~150 Readout System ~100 Event Builder 5 Data Logger Data Flow ~110 GB/s (~105 GB/s) ~4.5 GB/s (~9 GB/s) ~300 MB/s (~1100 MB/s) T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 2 / 13

The current Data Logging system: overview Purpose 5 PCs receive data from the Event Filter system and write it to local disks. Each event is: analyzed to determine the tags applied by the Event Filter trigger processed (e.g. compressed) written to appropriate file(s) according to the tags Details The event tags are determined by the trigger algorithms based on the event content To facilitate off-line data distribution, every event is written to multiple files, one per each of its tags File checksum is calculated while writing CPU-intensive! T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 3 / 13

The current Data Logging system: limitations The current Data Logger implementation is essentially single-threaded: multiple threads receive the events from the EF a single thread does the processing and writing This design is very unlikely to scale: maximum processing throughput: ~ 500 MB/s comparable with I/O (network and disks) limits Network I/O Thread Events Q. ev ev ev put event get event Processing and Writing Thread It is a major blocker for the addition of new features requiring more CPU power than a single core can provide due to this event-level data compression currently needs to be performed off-line, as an additional step T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 4 / 13

New design: general considerations The data processing workload is embarrassingly parallel: the incoming data are already divided in events Constraint The raw data file format is strictly sequential It is impossible to do concurrent writes to the same file Necessary to calculate the overall file checksum before writing to disk, aiding in the detection of write errors Keeps the format complexity to a minimum Multiple events can be written to different raw data files concurrently, but no more than one event can be written to each data file at once T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 5 / 13

New design: idea Split the workload in tasks For each event: one task does the processing multiple tasks do the writing (for each tag, a task writes the event to the corresponding file). Use a single thread pool to execute the tasks Schedule the tasks cleverly to avoid locking At any given time: any number of processing tasks can run for each raw data file, only one task writing to it can run T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 6 / 13

New design: finally, a diagram! Events Raw file manager File eγ Q. File μ Q. File jets Q. PT 14 PT 13 PT 12 WT 7-eγ WT 6-eγ WT 5-eγ WT 7-μ WT 6-μ WT 6-jets WT 5-jets schedule schedule only one notify completion Execution Q. PT 11 PT 10 WT 2-jets PT 9 PT nn WT nn-aa Processing task for event nn Writing task for stream aa of event nn enqueue run Thread Thread Thread WT 8-jets WT 1-eγ WT 1-μ PT 8 WT 8-μ T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 7 / 13

New design: implementation with Threading Building Blocks Events Raw file manager File eγ Q. File μ Q. File jets Q. PT 14 PT 13 PT 12 WT 7-eγ WT 6-eγ WT 5-eγ WT 7-μ WT 6-μ WT 6-jets WT 5-jets schedule schedule only one notify completion Execution Q. PT 11 PT 10 Writing task The new design was implemented WT 2-jets WT nn-aa for stream using aa of event nn (and inspired PT by) 9 the open source C++ library Thread run Thread PT nn Intel Threading Building Blocks Thread Processing task for event nn WT 8-jets enqueue WT 1-eγ WT 1-μ PT 8 WT 8-μ T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 8 / 13

New design: implementation with Threading Building Blocks Events PT 14 PT 13 PT 12 File eγ Q. WT 7-eγ File μ Q. Task based multi-threading Raw file manager File jets Q. WT 6-eγ WT 7-μ WT 6-jets WT 5-eγ WT 6-μ WT 5-jets schedule Task execution queue schedule only one notify completion Execution Q. PT 11 PT 10 WT 2-jets PT 9 PT nn WT nn-aa Processing task for event nn Writing task for stream aa of event nn enqueue Thread pool run Thread Thread Thread WT 8-jets WT 1-eγ WT 1-μ PT 8 WT 8-μ T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 8 / 13

New design: implementation with Threading Building Blocks Events Raw file manager File eγ Q. File μ Q. File jets Q. PT 14 PT 13 PT 12 WT 7-eγ WT 6-eγ WT 5-eγ WT 7-μ WT 6-μ WT 6-jets WT 5-jets schedule Concurrent queue notify completion Thread Execution Q. PT 11 PT 10 WT 2-jets PT 9 run Thread schedule only one Concurrent hash map PT nn WT nn-aa Thread Processing task for event nn Writing task for stream aa of event nn WT 8-jets enqueue Thread-safe containers optimized for concurrency WT 1-eγ WT 1-μ PT 8 WT 8-μ T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 8 / 13

Performance evaluation: resource utilization The new implementation was tested and compared with the old one in the current production system in a testbed with older hardware A single Data Logger machine was operated at saturation The dataset consisted of actual event data, with 1 to 4 tags assigned to each event changing the number of tags per event changes the number of files the Data Logger has to write each event to does not change the required network bandwidth Testbed Data Logger PC 2x dual-core Xeon 5130 4 GB RAM 3x 3ware RAID5 array 2x GbE NIC Production Data Logger PC 2x quad-core Xeon E5520 24 GB RAM 3x Adaptec RAID5 array 2x GbE NIC T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 9 / 13

Performance evaluation: resource utilization The new implementation was tested and compared with the old one in the current production system in a testbed with older hardware A single Data Logger machine was operated at saturation The dataset consisted of actual event data, with 1 to 4 tags assigned to each event changing the number of tags per event changes the number of files the Data Logger has to write each event to does not change the required network bandwidth T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 9 / 13

Performance evaluation: resource utilization The hard limit on the throughput of a single Data Logger is given by the network bandwidth: 2 Gb/s 250 MB/s Old single-threaded implementation Can operate at network saturation only for a single tag per event Above 2 tags per event, the load generated by its single thread exceeds what a single CPU core can take The throughput decreases accordingly T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 10 / 13

Performance evaluation: resource utilization The hard limit on the throughput of a single Data Logger is given by the network bandwidth: 2 Gb/s 250 MB/s New multi-threaded implementation The throughput is almost unaffected by the load Its 4 threads spread the workload on the 4 CPU cores: none of them uses more than 60% of a core Leaves plenty of headroom for additional CPU intensive processing compression T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 10 / 13

Performance evaluation: scalability (in testbed) On-line event compression (with zlib) radically changes the landscape The time spent compressing events (~ 50 ms per MB) dominates the rest of the processing (~ 2 ms per MB) throughput is much lower all workloads saturate the CPU One can examine the throughput as a function of the number of CPU cores (threads) used scaling is (almost) linear T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 11 / 13

Performance evaluation: scalability (in production) T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 12 / 13

Conclusions A novel design for the ATLAS Data Logging application was implemented and thoroughly tested The performance of the new software is very satisfactory: taps into the full power of modern CPUs future-proofs the Data Logger enables the addition of computationally-intensive features It will be one of the essential components of the evolved system currently being developed to meet the challenges of LHC data-taking in 2014 and beyond T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 13 / 13

Backup T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 14 / 13

Other design constraints Operations are driven by the received event data The Data Logger can only rely on the information it gathers by examining the received events No assumptions about the data flow Cannot assume that the rate of received events is somehow balanced across the spectrum of the possible tags The flow of events with one tag can vary during a run and even stop completely T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 15 / 13

for ask access 1 Stream 1 Lumiblock Event Queue 1 Stream 2 Lumiblock save event Other possible designs: thread pool with locking get event Processing Thread Raw File Manager Stream 2 Stream 3 T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 16 / 13

Event Queue 1 Stream 2 Lumiblock 1 Stream 1 Lumiblock Other possible designs: chain of responsibility Processing Thread Processing Thread Processing Thread Stream 2 Risk of starvation! T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 17 / 13

1 Stream 2 Lumiblock Event Queue 1 Stream 2 Lumiblock Other possible designs: one thread pool per file Processing Thread Processing Thread Processing Thread Stream 2 Too many threads! T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 18 / 13

zlib performance evaluation T. Colombo, W. Vandelli (Pavia U., CERN) ATLAS Highly-parallel Data-Logging System Real-Time Conference, 12 Jun 2012 19 / 13