Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014
Hadoop, Review
Hadoop Hadoop History Hadoop Framework Fault Tolerance Example Applications
Hadoop History Motivated by Dean and Ghemawat MapReduce: Simplified Data Processing on Large Clusters Proc. OSDI 2004 also Communications of the ACM, 51(1), 107-113 (2008) Paper describes Googles original implementation in C++ Based on this, Hadoop Developed by Doug Cutting (at Yahoo) and Mike Cafarella (graduate student at University of Washington now faculty at University of Michigan) Hadoop originally used to support Nutch search engine project Open source because needed a large number of developers Currently an Apache Foundation supported project Used by many companies for data analysis Hadoop uses Java as underlying language (efficiency concerns compared to C++, but wider number of possible users?)
Hadoop Framework Primarily aimed at data intensive computing Large scale databases on commodity hardware Slow network, typically ethernet, but reasonable fast commodity processors Distributed file system (Hadoop distributed file system - HDFS) Hadoop 2 is current release and has some backwards compatibility with Hadoop 1 Main feature is a map phase where local work is done without any communication, and then a reduce phase where communication is done. Main idea is that data is stored on local disks, so not required to move it over a slow network, only move results of query in map phase. For many problems, query results have much less data, so can give interactive response for large data sets over a low bandwidth high latency network.
Fault Tolerance Data is typically replicated thrice On primary disk, data is replicated twice to minimize access times. Third replication is done on a separate disk. System has a task scheduler (in Hadoop 1, this is not replicated, but in Hadoop 2, replicated to prevent system crashes) and then work coordinators which report back to task scheduler Should a part of the system fail, work is rescheduled There is also a lookup table or set of lookup tables Scheduling policy needs to allocate work to enable all jobs in an organization to be run efficiently. This is a tough problem to solve exactly, heuristics typically used.
Example Applications Search engines Examining computer internet access logs Online store purchase recommendations Credit reports Airline flight scheduling Consumer spending pattern analysis Machine translation Machine learning
Ways to Get Involved Part of an open source ecosystem Many developers from around the world Many extensions being proposed using the same model Targeted towards commodity hardware Some good redundancy ideas that may be useful for future exascale compute models
Review
Supercomputer Rankings Top 500 Green 500 Graph 500 Green Graph 500
Parallel Programming APIs OpenMP MPI OpenCL CAF UPC Many others, but the ones listed are mainstream at the moment
Performance measurement How long does your code take to run How well does your code utilize the hardware it is running on Amdhal s law Gustafsson s law
Roofline Model Upper bound for performance of your code or important kernels in your code Determine if algorithm is RAM bandwidth or compute bound at a node level Can extend ideas to system level by using interprocessor bandwidth rather than bandwidth from RAM
Measures of Efficiency Speedup Parallel Efficiency
Computer Chip Architecture Memory hierarchy: Registers, L1 Cache, L2 Cache, L3 Cache, RAM, disk Floating point units, integer units, instruction units, vector units, Fused multiply add, prefetching, cores, NUMA Clock speed
Computer Interconnect Toplogies Bus Ring Mesh All connected to All Hypercube 2D, 3D, 4D, 5D, 6D torii Fat Tree CLOS Dragonfly
Algorithm Operation Counts Dot-Product Matrix multiplication
Algorithm Runtime Performance Estimation Data processing: Functional operation count Operations per cycle cycles per second Data movement: Theory, Reduction, All-to-All on different simple network topologies Data movement: In practice, empirical models needed due to complexity of current networks Data movement: Theory, Switch
Single Core Optimization In many cases where one is doing parallelization to speed up time to solution, check whether re-optimizing your code will help Many people who re-program for Xeon Phi, make their codes faster for Xeon chip as well since they make it easier for the chip to do in order executions
Accelerators Good for floating point computations High energy efficiency Typically very many small processing units which do only a few things very efficiently (e.g. floating point operations) Very simple instruction scheduling Example: Xeon Phi, GPUs (Graphics Processing Unit), MPPA (Massively Parallel Processor Array), FPGA (Field Programmable Gate Array)
Vectorization Lowest level of parallelization Single instruction multiple data Heavily used in efficient floating point architectures such as Xeon Phi Requires regular memory accesses
Loop Parallelization Can be coarse grain or fine grain For many tasks, fine grain needs to be carefully implemented to avoid excessive overhead costs Usually easiest form of parallelizm to add to a code
Task Parallelization Often used in information processing applications Need to schedule appropriately to ensure available resources are used Need to ensure all data is available
DAG scheduling Method to expose parallelizm by determining task dependencies Heavily used in information processing applications Being applied to massively concurrent architectures in high performance computing Need to be careful in implementation to avoid high memory costs from generating the graph Just in time dependency generation is one way of reducing memory requirements
Load Balancing Need to be careful in how partition work Want all processes to be kept busy during parallel execution Static load balancing can be done easily at startup time if the program allows for this Dynamic load balancing more difficult, typically arises in adaptive solution of models for physical processes
Sorting Many ways to do this in parallel, examples include quick sort, merge sort Typically does not require many floating point operations, instead doing comparisons Used in bioinformatics, particle based algorithms, etc.
I/O Usually slowest operation File systems are complicated For many scientific computing applications, try to minimize this For many data applications, this is the most important part of the problem
Hadoop and Mapreduce Parallel programming model, primarily aimed for data processing Few commands which allow easy buildup of parallel programs
Other Topics Pipelining Vector processors (NEC SX-ACE) Non-traditional uses of supercomputing: text analysis, economic planning
References http://hadoop.apache.org/docs/current/ https://en.wikipedia.org/wiki/apache_hadoop Herodotou, H. Automatic Tuning of Data-Intensive Analytical Workloads PhD Thesis, Duke University (2012), http://www.cs.duke.edu/ hero/research.html Dean, J. and Gehmawat, S. MapReduce: Simplified Data Processing on Large Clusters Proc. OSDI 04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, USA, December, (2004) also Communications of the ACM, 51(1), 107-113 (2008)
References http://www.drdobbs.com/database/ hadoop-tutorial-series/240155055 Wadkar, S., Siddalingaiah, M. and Venner, J. Pro Apache Hadoop 2nd Ed. Apress (2014) Shenoy, A. Hadoop Explained Packt Publishing (2014) Nielsen, L. Hadoop for Laymen Newstreet Communications (2014) Lockwood, G.L. Tutorials in Data-Intensive Computing http://www.glennklockwood.com/di/index.php (2014)
Myroslava Stavnycha Acknowledgements