HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Size: px

Start display at page:

Download "HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect [email protected]"

Adrian Young
10 years ago
Views:

1 HPC and Big Data EPCC The University of Edinburgh Adrian Jackson Technical Architect

2 EPCC Facilities Technology Transfer European Projects HPC Research Visitor Programmes Training EPCC is the HPC Centre of the University of Edinburgh Vital statistics: ~75 staff ~ 4M turnover from external sources Multidisciplinary and multi-funded with a large spectrum of activities and a critical mass of expertise Supports research through: Access to facilities Training courses Visitor programmes Collaborative projects HPC and Big Data 2

Multidisciplinary and multi-funded with a large spectrum of activities and a critical mass of expertise

3 HPC Big compute Scientific simulation Third pillar of science Explore universe through simulation rather than experimentation Test theories Predict or validate experiments Simulate untestable science Reproduce real world in computers Generally simplified Dimensions and timescales restricted Simulation of scientific problem or environment Input of real data Output of simulated data Parameter space studies Wide range of approaches HPC and Big Data 3

in computers Generally simplified Dimensions and timescales restricted Simulation of scientific problem or

4 Performance Trend FLOPS Yotta: Zetta: Exa: Peta: Tera: Giga: 10 9 Mega: 10 6 Kilo: 10 3 This graph is borrowed from Wikipedia Lucas wilkins HPC and Big Data 4

5 Performance Trend HPC and Big Data 5

6 Key challenges Scale Ensuring program can utilise resources Decompose problem over processes Overheads Communication costs Synchronisation Serial parts I/O Utilisation Ensuring resources load-balanced Ensuring machines fully utilised HPC and Big Data 6

Synchronisation Serial parts I/O Utilisation Ensuring

7 Data Intensive Computing Large amounts of data to be processed Low computing requirements Independent tasks Key challenges Distributing data to compute Minimise data movements Hadoop, MapReduce, HPCC SciDB Fault tolerance/reliability HPC and Big Data 7

Distributing data to compute Minimise data movements Hadoop,

8 Big data Worldwide LHC Computing Grid Distribute and manage LHC data; 25 PB per year Computing resource requirement too large of one site Grid technology driver OGSA-DAI: Distributed data access and management Federate and access resources (e.g. relational or XML databases, files or web services) via web services on the web or within grids or clouds. Query, update, transform, and combined data Enable user to focus on application-specific data analysis and processing. HPC and Big Data 8

9 Data Projects ADMIRE: Architecture for Data Intensive Research single platform for knowledge discovery: data access, integration, pre-processing, data mining, statistical analysis, post-processing, transformation DISPEL, Java-like language for describing complex data-intensive workflows Streaming execution engine to remove data bottlenecks Visual programming tools based on the eclipse platform Library of common workflows and components HPC and Big Data 9

post-processing, transformation DISPEL, Java-like language for describing complex data-intensive workflows Streaming

10 Example: Oncology Aim: investigate genetic causes of bowel cancer Collaborative project between EPCC and the Colon Cancer Genetics Group (CCGG) Vast amount of data Over 500,000 genetic markers from 2000 people Two-stage study Stage 1: investigated effect of each individual marker Required ~565,000 computations, O(N) problem Predicted serial runtime ~4 months on a single cpu Parallel code took 6.5 hours on 128 processors (BlueGene/L) ( HPC and Big Data 10

investigated effect of each individual marker Required ~565,000 computations, O(N) problem Predicted serial runtime ~4

11 Example: Oncology Stage 2: investigated interactions between the gene markers Every pair of markers must be tested O(N 2 ) problem 565,000 x 565,000 2 = 1.5 billion gene interactions! Key challenges: runtime, memory, scaling & sorting HPC and Big Data 11

problem 565,000 x 565,000 2 = 1.5 billion gene interactions!

12 Oncology Runtime code expected to take 400 days, optimisation reduced this to 130 days but still too long Need a parallel code Memory Impossible to fit all the data into memory However, we only actually need 5% of the results Scaling 2D decomposition used with a task farm More chunks than processors Sorting Parallel sorting algorithm used Computed interactions between all pairs of markers 565,000 2 computations Runtime reduced from 400 days to 5 hours on 512 CPUs on HECToR 8.5x10 9 (192GB) probability values obtained Sorting performed in 5 minutes HPC and Big Data 12

chunks than processors Sorting Parallel sorting algorithm used Computed interactions between all pairs of markers 565,000 2 computations

13 Square Kilometre Array (SKA) Largest and most sensitive radio telescope in the world to be built in South Africa and Australia 3000 dishes 1 EB data generated per day HPC and Big Data 13

14 Facilities: EDIM1 A machine for Data Intensive Research Commissioned by EPCC & Informatics Designed for I/O-intensive applications Use commodity components Combine them in a novel way Use cheap low-power processors HPC and Big Data 14

I/O-intensive applications Use commodity components

15 Facilities: EDIM1 EDIM1 120 nodes Dual-Core Intel 1.6 GHz ATOM processor NVIDIA ION GPU 1 x 256 MB SSD 3 x 2TB HDD Data staging node for hot-plugging SATA hard disks for direct data upload SDSC Gordon Purpose built DIR machine 1024 compute nodes: 2 x 8-core Intel processors, 64 GB memory 64 I/O nodes: Gb Ethernet 2 x 6-core Intel processors, 48 GB memory, 16 x 300 GB SSD USB2 HPC and Big Data 15

SATA hard disks for direct data upload SDSC Gordon Purpose built DIR machine 1024 compute nodes: 2

16 Facilities: DiRAC and Indy Indy: Linux and Windows HPC cluster 1536 cores 24 nodes: 64 cores, 256GB 128 cores 2 nodes: 64 cores, 512GB Commercial usage focus No job length or queue restrictions DiRAC: BlueGene/Q 6144 compute nodes compute cores 1.26PFlop/s HPC and Big Data 16

Commercial usage focus No job length or queue restrictions DiRAC:

17 Facilities: HECToR UK National HPC Service Currently 30- cabinet Cray XE6 system 90,112 cores Each node has 2 16-core AMD Opterons (2.3GHz Interlagos) 32 GB memory Peak of over 800 TF and 90 TB of memory HPC and Big Data 17

18 Cray XE6 Layout Compute nodes Login nodes Lustre OSS Lustre MDS NFS Server Boot/SDB node 1 GigE Backbone Cray XE6 Supercomputer Infiniband Switch 10 GigE Backup and Archive Servers Lustre high-performance, parallel filesystem HPC and Big Data

Supercomputer Infiniband Switch 10 GigE Backup and Archive

19 Filesystem High performance file system Lustre: /work Smaller space for /home Filesystem globally access /work from backend /home from frontend Connected over compute network HPC and Big Data

20 UK Research Data Facility RDF consists of 7.8PB disk 19.5 PB backup tape Provide a high capacity robust file store; Persistent infrastructure - will last beyond any one national service; Will remove end of service data issues - transfers at end of services have become increasingly lengthy; Will also ensure that data from the current HECToR service is secured - this will ensure a degree of soft landing if there is ever a gap in National Services; RDF is designed for long term data storage Currently only open to HECToR users HPC and Big Data 20

remove end of service data issues - transfers at end of services have become increasingly lengthy; Will also ensure that data from the

21 Facilities: GPGPU Test-bed Evaluate how your application performs compared to traditional CPU or develop your application to run on GPGPUs GPGPUs NVIDIA Fermi C2050 NVIDIA Fermi C2070 AMD FireStream 9270 NVIDIA K20 HPC and Big Data 21

22 Contact details EPCC, The University of Edinburgh JCMB, Mayfield Road, Edinburgh EH9 3JZ HPC and Big Data 22

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory