Branch-and-Bound with Peer-to-Peer for Large-Scale Grids Alexandre di Costanzo INRIA - CNRS - I3S - Université de Sophia Antipolis Ph.D. defense Friday 12th October 2007 Advisor: Prof. Denis Caromel
The Big Picture 2
The Big Picture Objective 2
The Big Picture Objective Solving combinatorial optimization problems with Grids 2
The Big Picture Objective Solving combinatorial optimization problems with Grids Approach 2
The Big Picture Objective Solving combinatorial optimization problems with Grids Approach Parallel Branch-and-Bound and Peer-to-Peer 2
The Big Picture Objective Solving combinatorial optimization problems with Grids Approach Parallel Branch-and-Bound and Peer-to-Peer Contributions 2
The Big Picture Objective Solving combinatorial optimization problems with Grids Approach Parallel Branch-and-Bound and Peer-to-Peer Contributions 1. Branch-and-Bound framework for Grids 2
The Big Picture Objective Solving combinatorial optimization problems with Grids Approach Parallel Branch-and-Bound and Peer-to-Peer Contributions 1. Branch-and-Bound framework for Grids 2. Peer-to-Peer Infrastructure for Grids 2
The Big Picture Objective Solving combinatorial optimization problems with Grids Approach Parallel Branch-and-Bound and Peer-to-Peer Contributions 1. Branch-and-Bound framework for Grids 2. Peer-to-Peer Infrastructure for Grids 3. Large-scale experiments 2
Agenda Context, Problem, and Related Work Contributions Branch-and-Bound Framework for Grids Desktop Grid with Peer-to-Peer Mixing Desktops & s Perspectives & Conclusion 3
Context 4
Context Combinatorial Optimization Problems (COPs) costly to solve (finding the best solution) 4
Context Combinatorial Optimization Problems (COPs) costly to solve (finding the best solution) Branch-and-Bound (BnB) well adapted for solving COPs relatively easy to provide parallel version 4
Context Combinatorial Optimization Problems (COPs) costly to solve (finding the best solution) Branch-and-Bound (BnB) well adapted for solving COPs relatively easy to provide parallel version Grid Computing large pool of resources large-scale environment 4
Branch-and-Bound 5
Branch-and-Bound Consists of a partial enumeration of all feasible solutions and returns the guaranteed optimal solution 5
Branch-and-Bound Consists of a partial enumeration of all feasible solutions and returns the guaranteed optimal solution Feasible solutions are organized as a tree: search-tree 5
Branch-and-Bound Consists of a partial enumeration of all feasible solutions and returns the guaranteed optimal solution Feasible solutions are organized as a tree: search-tree 3 operations: 5
Branch-and-Bound Consists of a partial enumeration of all feasible solutions and returns the guaranteed optimal solution Feasible solutions are organized as a tree: search-tree 3 operations: Branching: split in sub-problems 5
Branch-and-Bound Consists of a partial enumeration of all feasible solutions and returns the guaranteed optimal solution Feasible solutions are organized as a tree: search-tree 3 operations: Branching: split in sub-problems Bounding: compute lower/upper bounds (objective function) 5
Branch-and-Bound Consists of a partial enumeration of all feasible solutions and returns the guaranteed optimal solution Feasible solutions are organized as a tree: search-tree 3 operations: Branching: split in sub-problems Bounding: compute lower/upper bounds (objective function) Pruning: eliminate bad branches 5
Branch-and-Bound Consists of a partial enumeration of all feasible solutions and returns the guaranteed optimal solution Feasible solutions are organized as a tree: search-tree 3 operations: Branching: split in sub-problems Bounding: compute lower/upper bounds (objective function) Pruning: eliminate bad branches Well adapted for solving COPs [Papadimitriou 98] 5
Branch-and-Bound Consists of a partial enumeration of all feasible solutions and returns the guaranteed optimal solution Feasible solutions are organized as a tree: search-tree 3 operations: Branching: split in sub-problems Bounding: compute lower/upper bounds (objective function) Pruning: eliminate bad branches Well adapted for solving COPs [Papadimitriou 98] return the best combinaison out of all 5
Branch-and-Bound Consists of a partial enumeration of all feasible solutions and returns the guaranteed optimal solution Feasible solutions are organized as a tree: search-tree 3 operations: Branching: split in sub-problems Bounding: compute lower/upper bounds (objective function) Pruning: eliminate Branching: split in bad sub-problems branches Bounding: compute local lower/upper bounds Well adapted for solving COPs [Papadimitriou 98] Pruning: local lower bound higher than the global one Not generated and explored parts of the tree return the best combinaison out of all 5
Parallel Branch-and-Bound 6
Parallel Branch-and-Bound COPs are difficult to solve: enumeration size & NP-hard class 6
Parallel Branch-and-Bound COPs are difficult to solve: enumeration size & NP-hard class Many studies on parallel approach [Gendron 94, Crainic 06] node-based: parallel bounding on sub-problems tree-based: building the tree in parallel multi-search: several trees are generated in parallel 6
Parallel Branch-and-Bound COPs are difficult to solve: enumeration size & NP-hard class Many studies on parallel approach [Gendron 94, Crainic 06] node-based: parallel bounding on sub-problems tree-based: building the tree in parallel multi-search: several trees are generated in parallel Tree-based is the most studied [Authié 95, Crainic 06] the solution tree is not known beforehand no part of the tree may be estimated at compilation tasks are dynamically generated task allocations to processors must be done dynamically distributing issues, such as load-balancing & information sharing 6
Parallel Branch-and-Bound COPs are difficult to solve: enumeration size & NP-hard class Many studies on parallel approach [Gendron 94, Crainic 06] node-based: parallel bounding on sub-problems tree-based: building the tree in parallel multi-search: several trees are generated in parallel Tree-based is the most studied [Authié 95, Crainic 06] the solution tree is not known beforehand no part of the tree may be estimated at compilation tasks are dynamically generated task allocations to processors must be done dynamically distributing issues, such as load-balancing & information sharing Sharing a global bound for optimizing the prune operation 6
Parallel Branch-and-Bound COPs are difficult to solve: enumeration size & NP-hard class Many studies on parallel approach [Gendron 94, Crainic 06] node-based: parallel bounding on sub-problems tree-based: building the tree in parallel multi-search: several trees are generated in parallel Tree-based is the most studied [Authié 95, Crainic 06] the solution tree is not known beforehand no part of the tree may be estimated at compilation tasks are dynamically generated task allocations to processors must be done dynamically distributing issues, such as load-balancing & information sharing Sharing a global bound for optimizing the prune operation 6
Parallel Branch-and-Bound COPs are difficult to solve: enumeration size & NP-hard class Many studies on parallel approach [Gendron 94, Crainic 06] node-based: parallel bounding on sub-problems Proposition: Parallel BnB + Grid tree-based: building the tree in parallel multi-search: several trees are generated in parallel Tree-based is the most known by users & Related difficulties are known Tree-based is the most studied [Authié 95, Crainic 06] the solution tree is not known beforehand no part of the tree may be estimated at compilation tasks are dynamically generated task allocations to processors must be done dynamically distributing issues, such as load-balancing & information sharing Sharing a global bound for optimizing the prune operation 6
BnB Related Work Frameworks Algorithms Parallelization Machines PUBB Low-level, Basic B&B SPMD /PVM BOB++ Low-level, Basic B&B SPMD /MPI PPBB-Lib Basic B&B SPMD /PVM PICO Basic B&B, Mixed-integer LP hier. master-worker /MPI MallBa Low-level, Basic B&B SPMD /MPI ZRAM Low-level, Basic B&B SPMD /PVM Low-level ALPS/BiCePS Basic B&B hier. master-worker /MPI Mixed-integer LP Branch&Price&Cut Metacomputing MW Basic B&B master-worker Grids/Condor Symphony Mixed-integer LP, Branch&Price&Cut master-worker /PVM 7
BnB Related Work Frameworks Algorithms Parallelization Machines PUBB Low-level, Basic B&B SPMD /PVM BOB++ Low-level, Basic B&B SPMD /MPI PPBB-Lib Basic B&B SPMD /PVM PICO Basic B&B, Mixed-integer LP hier. master-worker /MPI MallBa Low-level, Basic B&B SPMD /MPI ZRAM Low-level, Basic B&B SPMD /PVM Low-level ALPS/BiCePS Basic B&B hier. master-worker /MPI Mixed-integer LP Branch&Price&Cut Metacomputing MW Basic B&B master-worker Grids/Condor Symphony Mixed-integer LP, Branch&Price&Cut master-worker /PVM Most Previous work are based on SPMD and target clusters better for sharing bounds 7
Grid Computing 8
Grid Computing Distributed shared computing infrastructure multi-institutional virtual organization 8
Grid Computing Distributed shared computing infrastructure multi-institutional virtual organization Provide large pool of resources 8
Grid Computing Distributed shared computing infrastructure multi-institutional virtual organization Provide large pool of resources New challenges geographically distributed deployment scalability communication fault-tolerance multiple administrative domains, heterogeneity, high performance, programming model, etc. 8
Grid Computing Distributed shared computing infrastructure multi-institutional virtual organization Provide large pool of resources New challenges geographically distributed deployment scalability communication fault-tolerance multiple administrative domains, heterogeneity, high performance, programming model, etc. Parallel BnB related work: SPMD Grids are not adapted for SPMD (heterogeneity, latency, etc.) 8
Grid Computing Grids involve new concepts for development & execution of applications 9
Grid Computing Grids involve new concepts for development & execution of applications Grid Fabric Schedulers, Networking, OS Federated Hardware Resources 9
Grid Computing Grids involve new concepts for development & execution of applications Grid Middleware Infrastructure Grid Fabric Super-Schedulers, Resource Trading, Information, Security, etc. Schedulers, Networking, OS Federated Hardware Resources 9
Grid Computing Grids involve new concepts for development & execution of applications Grid Programming Grid Middleware Infrastructure Grid Fabric Models, Tools, High-Level Access to Middleware Super-Schedulers, Resource Trading, Information, Security, etc. Schedulers, Networking, OS Federated Hardware Resources 9
Grid Computing Grids involve new concepts for development & execution of applications Grid Applications & Portals Grid Programming Grid Middleware Infrastructure Grid Fabric Models, Tools, High-Level Access to Middleware Super-Schedulers, Resource Trading, Information, Security, etc. Schedulers, Networking, OS Federated Hardware Resources 9
Grid Applications & Portals Grid Computing Grids involve new concepts for development & execution of applications Grid Programming Grid Middleware Infrastructure Grid Fabric Models, Tools, Branch-and-Bound API High-Level Access to Middleware Super-Schedulers, Resource Trading, Information, Security, etc. Schedulers, Networking, OS Federated Hardware Resources 9
Challenges Combinatorial Optimization Problems costly to solve 10
Challenges Combinatorial Optimization Problems costly to solve Branch-and-Bound adapted for solving COPs easy to parallelize 10
Challenges Combinatorial Optimization Problems costly to solve Branch-and-Bound adapted for solving COPs easy to parallelize Grid Computing huge number of resources 10
Challenges Combinatorial Optimization Problems costly to solve Branch-and-Bound adapted for solving COPs easy to parallelize + Grid Computing huge number of resources Use Branch-and-Bound on Grids for solving COPs 10
Challenges Combinatorial Optimization Problems costly to solve Branch-and-Bound adapted for solving COPs easy to parallelize + Grid Computing huge number of resources Use Branch-and-Bound on Grids for solving COPs Efficient communications with Grids is difficult Problem with sharing bounds 10
Agenda Context, Problem, and Related Work Contributions Branch-and-Bound Framework for Grids Desktop Grid with Peer-to-Peer Mixing Desktops & s Perspectives & Conclusion 11
BnB for Grids Related Work 12
BnB for Grids Related Work Aida et al. focus on the design: hierarchical master-worker scales on Grids 12
BnB for Grids Related Work Aida et al. focus on the design: hierarchical master-worker scales on Grids Foster et al. fully decentralized approach: communication overhead 12
BnB for Grids Related Work Aida et al. focus on the design: hierarchical master-worker scales on Grids Foster et al. fully decentralized approach: communication overhead ParadisEO focus on meta-heuristics (not exact solution): master-worker 12
BnB for Grids Related Work Aida et al. focus on the design: hierarchical master-worker scales on Grids Foster et al. fully decentralized approach: communication overhead ParadisEO focus on meta-heuristics (not exact solution): master-worker Skeletons with farm or divide-and-conquer Satin for divide-and-conquer 12
BnB on Grids: Problems and Solutions 13
BnB on Grids: Problems and Solutions 13
BnB on Grids: Problems and Solutions Latency Scalability Solution tree size Share the best bounds Faults Asynchronous communications Hierarchical masterworker Dynamically generated by splitting tasks Efficient parallelism & communication Fault-tolerance 13
BnB on Grids: Problems and Solutions Latency Scalability Solution tree size Share the best bounds Faults Asynchronous communications Hierarchical masterworker Dynamically generated by splitting tasks Efficient parallelism & communication Fault-tolerance 13
BnB on Grids: Problems and Solutions Latency Scalability Solution tree size Share the best bounds Faults Asynchronous communications Hierarchical masterworker Dynamically generated by splitting tasks Efficient parallelism & communication Fault-tolerance 13
BnB on Grids: Problems and Solutions Latency Scalability Solution tree size Share the best bounds Faults Asynchronous communications Hierarchical masterworker Dynamically generated by splitting tasks Efficient parallelism & communication Fault-tolerance 13
BnB on Grids: Problems and Solutions Latency Scalability Solution tree size Share the best bounds Faults Asynchronous communications Hierarchical masterworker Dynamically generated by splitting tasks Efficient parallelism & communication Fault-tolerance 13
BnB on Grids: Problems and Solutions Latency Scalability Solution tree size Share the best bounds Faults Asynchronous communications Hierarchical masterworker Dynamically generated by splitting tasks Efficient parallelism & communication Fault-tolerance 13
BnB on Grids: Problems and Solutions Latency Scalability Solution tree size Share the best bounds Faults Asynchronous communications Hierarchical masterworker Dynamically generated by splitting tasks Efficient parallelism & communication Fault-tolerance 13
BnB on Grids: Problems and Solutions Latency Scalability Solution tree size Share the best bounds Faults Asynchronous communications Hierarchical masterworker Dynamically generated by splitting tasks Efficient parallelism & communication Fault-tolerance 13
BnB on Grids: Problems and Solutions Latency Scalability Solution tree size Share the best bounds Faults Asynchronous communications Hierarchical masterworker Dynamically generated by splitting tasks Efficient parallelism & communication Fault-tolerance 13
BnB on Grids: Problems and Solutions Latency Scalability Solution tree size Share the best bounds Faults Asynchronous communications Hierarchical masterworker Dynamically generated by splitting tasks Efficient parallelism & communication Fault-tolerance 13
BnB on Grids: Problems and Solutions Asynchronous Latency communications Scalability Solution tree size Share the best bounds Hierarchical masterworker Objective: hide Grid difficulties to users Especially communication problems Dynamically generated by splitting tasks Efficient parallelism & communication Faults Fault-tolerance 13
BnB Framework - Entities and Roles 14
BnB Framework - Entities and Roles Root Task: implemented by users objective-function and splitting/branching operation 14
BnB Framework - Entities and Roles Root Task: implemented by users objective-function and splitting/branching operation Master: Entry Point splits the problem in tasks collects partial-results the best solution 14
BnB Framework - Entities and Roles Root Task: implemented by users objective-function and splitting/branching operation Master: Entry Point splits the problem in tasks collects partial-results the best solution Sub-Master: intermediary between master and workers 14
BnB Framework - Entities and Roles Root Task: implemented by users objective-function and splitting/branching operation Master: Entry Point splits the problem in tasks collects partial-results the best solution Sub-Master: intermediary between master and workers : computes tasks 14
BnB Framework - Architecture Master Submaster Submaster 15
BnB Framework - Architecture Problem Master Submaster Submaster 15
BnB Framework - Architecture Problem Master First splitting Submaster Submaster 15
BnB Framework - Architecture Problem Master Pending tasks First splitting Submaster Submaster 15
BnB Framework - Architecture Problem Master Pending tasks Submaster Submaster 15
BnB Framework - Architecture Problem Master Pending tasks Ask task Submaster Submaster 15
BnB Framework - Architecture Problem Master Pending tasks Submaster Submaster 15
BnB Framework - Architecture Problem Master Pending tasks Ask task Submaster Submaster 15
BnB Framework - Architecture Problem Master Pending tasks Submaster Submaster 15
BnB Framework - Architecture Problem Master Pending tasks Got task Submaster Submaster 15
BnB Framework - Architecture Problem Master Pending tasks Submaster Submaster 15
BnB Framework - Architecture Problem Master Pending tasks Got task Submaster Submaster 15
BnB Framework - Architecture Problem Master Pending tasks Submaster Submaster 15
BnB Framework - Architecture Problem Master Pending tasks Submaster Submaster Computing 15
BnB Framework - Architecture Problem Master Pending tasks Submaster Submaster Sending result Computing 15
BnB Framework - Architecture Problem Master Pending tasks Submaster Submaster Computing 15
BnB Framework - Architecture Problem Master Pending tasks Submaster Submaster Computing 15
BnB Framework - Architecture Problem Master Pending tasks Partial results Submaster Submaster Computing 15
BnB Framework - Solutions 16
BnB Framework - Solutions Context ProActive Java Grid middleware: latency asynchronous communication underlaying Grid infrastructure deployment framework (abstraction) 16
BnB Framework - Solutions Context ProActive Java Grid middleware: latency asynchronous communication underlaying Grid infrastructure deployment framework (abstraction) Implement the tree-based parallel Master-worker architecture Problem: workers need to share bounds difficult to adapt SPMD for Grids (heterogeneity, distribution, etc.) 16
BnB Framework - Solutions Context ProActive Java Grid middleware: latency asynchronous communication underlaying Grid infrastructure deployment framework (abstraction) Implement the tree-based parallel Master-worker architecture Problem: workers need to share bounds difficult to adapt SPMD for Grids (heterogeneity, distribution, etc.) Solution 1: Master keeps the bound 16
BnB Framework - Solutions Context ProActive Java Grid middleware: latency asynchronous communication underlaying Grid infrastructure deployment framework (abstraction) Implement the tree-based parallel Master-worker architecture Problem: workers need to share bounds difficult to adapt SPMD for Grids (heterogeneity, distribution, etc.) Solution 1: Master keeps the bound - previous work shows that not scale [Aida 2003] 16
BnB Framework - Solutions Context ProActive Java Grid middleware: latency asynchronous communication underlaying Grid infrastructure deployment framework (abstraction) Implement the tree-based parallel Master-worker architecture Problem: workers need to share bounds difficult to adapt SPMD for Grids (heterogeneity, distribution, etc.) Solution 1: Master keeps the bound - previous work shows that not scale [Aida 2003] Solution 2: Message framework (Enterprise Service Bus) 16
BnB Framework - Solutions Context ProActive Java Grid middleware: latency asynchronous communication underlaying Grid infrastructure deployment framework (abstraction) Implement the tree-based parallel Master-worker architecture Problem: workers need to share bounds difficult to adapt SPMD for Grids (heterogeneity, distribution, etc.) Solution 1: Master keeps the bound - previous work shows that not scale [Aida 2003] Solution 2: Message framework (Enterprise Service Bus) - Grid middleware dependent / Good for SOA 16
BnB Framework - Solutions Context ProActive Java Grid middleware: latency asynchronous communication underlaying Grid infrastructure deployment framework (abstraction) Implement the tree-based parallel Master-worker architecture Problem: workers need to share bounds difficult to adapt SPMD for Grids (heterogeneity, distribution, etc.) Solution 1: Master keeps the bound - previous work shows that not scale [Aida 2003] Solution 2: Message framework (Enterprise Service Bus) - Grid middleware dependent / Good for SOA Solution 3: Broadcasting 16
BnB Framework - Solutions Context ProActive Java Grid middleware: latency asynchronous communication underlaying Grid infrastructure deployment framework (abstraction) Implement the tree-based parallel Master-worker architecture Problem: workers need to share bounds difficult to adapt SPMD for Grids (heterogeneity, distribution, etc.) Solution 1: Master keeps the bound - previous work shows that not scale [Aida 2003] Solution 2: Message framework (Enterprise Service Bus) - Grid middleware dependent / Good for SOA Solution 3: Broadcasting - 1 to n communication cannot scale 16
BnB Framework - Solutions Context ProActive Java Grid middleware: latency asynchronous communication underlaying Grid infrastructure deployment framework (abstraction) Implement the tree-based parallel Master-worker architecture Problem: workers need to share bounds difficult to adapt SPMD for Grids (heterogeneity, distribution, etc.) Solution 1: Master keeps the bound - previous work shows that not scale [Aida 2003] Solution 2: Message framework (Enterprise Service Bus) - Grid middleware dependent / Good for SOA Solution 3: Broadcasting - 1 to n communication cannot scale hierarchical broadcasting scale [Baduel 05] 16
Organizing Communications for Broadcasting 17
Organizing Communications for Broadcasting Idea: Grids are composed of clusters organizing s in groups 17
Organizing Communications for Broadcasting Idea: Grids are composed of clusters organizing s in groups clusters are high-performance communication environments 17
Organizing Communications for Broadcasting Idea: Grids are composed of clusters organizing s in groups clusters are high-performance communication environments 17
Organizing Communications for Broadcasting Idea: Grids are composed of clusters organizing s in groups clusters are high-performance communication environments Solution: add a new entity for organizing communications: Leader Leader is a chose by the Master for each group 17
Organizing Communications for Broadcasting Idea: Grids are composed of clusters organizing s in groups clusters are high-performance communication environments Solution: add a new entity for organizing communications: Leader Leader is a chose by the Master for each group 17
Organizing Communications for Broadcasting Idea: Grids are composed of clusters organizing s in groups clusters are high-performance communication environments Solution: add a new entity for organizing communications: Leader Leader is a chose by the Master for each group Process to update Bounds: 1. the broadcasts the new Bound inside its group 2. the group Leader broadcasts the new Bound to all Leaders 3. each Leader broadcasts the new value inside their groups 17
BnB framework - Communications for Sharing Bound Leader Submaster Submaster Leader Leader Leader 18
BnB framework - Communications for Sharing Bound Leader Submaster Submaster Leader Leader Leader 18
BnB framework - Communications for Sharing Bound Leader Submaster Submaster Leader Leader Leader 18
BnB framework - Communications for Sharing Bound Leader Submaster Submaster Leader Leader Leader 18
BnB framework - Communications for Sharing Bound Leader Submaster Submaster Leader Leader Leader new best bound! 18
BnB framework - Communications for Sharing Bound Leader Submaster Submaster Leader Leader Leader new best bound! 18
BnB framework - Communications for Sharing Bound Leader Submaster Submaster Leader Leader Leader new best bound! 18
BnB framework - Communications for Sharing Bound Leader Submaster Submaster Leader Leader Leader new best bound! 18
BnB framework - Communications for Sharing Bound Leader Submaster Submaster Leader Leader Leader new best bound! 18
BnB framework - Communications for Sharing Bound Grid middleware provides communications Leader Submaster Submaster Leader Leader Leader new best bound! 18
BnB Search Strategies The improvement of the Bound 1. depends of how it is shared (communication) 2. depends of how the search-tree is generated: Classical Depth-First Search Breadth-First Search Contribution First-In-First-Out (FIFO) Priority open API... 19
BnB Framework: Fault-Tolerance 20
BnB Framework: Fault-Tolerance Manage user exception computation stopped 20
BnB Framework: Fault-Tolerance Manage user exception computation stopped For us, a fault is a Failed Stop 20
BnB Framework: Fault-Tolerance Manage user exception computation stopped For us, a fault is a Failed Stop fault handled by (sub-)master 20
BnB Framework: Fault-Tolerance Manage user exception computation stopped For us, a fault is a Failed Stop fault handled by (sub-)master Leader fault master choose new one 20
BnB Framework: Fault-Tolerance Manage user exception computation stopped For us, a fault is a Failed Stop fault handled by (sub-)master Leader fault master choose new one Sub-master fault manager change a worker to sub-master 20
BnB Framework: Fault-Tolerance Manage user exception computation stopped For us, a fault is a Failed Stop fault handled by (sub-)master Leader fault master choose new one Sub-master fault manager change a worker to sub-master Master fault back-up file 20
BnB Framework: Fault-Tolerance Manage user exception computation stopped For us, a fault is a Failed Stop fault handled by (sub-)master Leader fault master choose new one Sub-master fault manager change a worker to sub-master Master fault back-up file Load-Balancing is natural with master-worker 20
BnB Framework: Fault-Tolerance Manage user exception computation stopped For us, a fault is a Failed Stop fault handled by (sub-)master Leader fault master choose new one Sub-master fault manager change a worker to sub-master Master fault back-up file Load-Balancing is natural with master-worker the framework provides a function to get the number of free s users use it to decide branching 20
Grid BnB [HiPC07] Features 21
Grid BnB [HiPC07] Features Design 21
Grid BnB [HiPC07] Features Asynchronous communications Hierarchical master-worker with com. Dynamic task splitting Efficient communications with groups Fault-tolerance Design 21
Grid BnB [HiPC07] Features Asynchronous communications Hierarchical master-worker with com. Dynamic task splitting Efficient communications with groups Fault-tolerance Design Users 21
Grid BnB [HiPC07] Features Asynchronous communications Hierarchical master-worker with com. Dynamic task splitting Efficient communications with groups Fault-tolerance Design Hidden parallelism and Grid difficulties API for COPs Ease of deployment Principally tree-based Implementing and testing search strategies Focus on objective function Users 21
Grid BnB [HiPC07] Features Asynchronous communications Hierarchical master-worker with com. Dynamic task splitting Design Efficient communications with groups Fault-tolerance Validate and Test Grid'BnB by experiments Hidden parallelism and Grid difficulties API for COPs Ease of deployment Principally tree-based Implementing and testing search strategies Focus on objective function Users 21
Flow-Shop Experiments NP-hard permutation optimization problem J = {j1, j2,... jn} ji = { oi1, oi2,... oim } M = {m1,m2,...mm} m 4 o 1,4 o 2,4 o 3,4 o 4,4 m 3 o 1,3 o 2,3 o 3,3 o 4,3 m 2 m 1 o 1,2 o 2,2 o 3,2 o 4,2 o 1,1 o 2,1 o 3,1 o 4,1 Execution time 22
Single : Search Strategies Flow-Shop: 16 Jobs / 20 Machines 250 200 Time in minutes 150 100 50 0 20 30 40 50 60 CPUs FIFO Depth-first search Breadth-first search Priority 23
Single : Communications Flow-Shop: 16 Jobs / 20 Machines 80 3 70 60 Time in minutes 50 40 30 20 2 Ratio Factor 10 0 1 20 30 40 50 60 CPUs With Group Communications Ratio (With Com / No Com) Without Group Communications 24
Grid'5000: the French Grid Lille Rennes Orsay Nancy Bordeaux Lyon Grenoble Toulouse Sophia 9 sites - 14 clusters - 3586 CPUs 25
Grid Experimentations up to 621 CPUs on 5 sites Flow-Shop: 17 Jobs / 17 Machines 120 100 Time in minutes 80 60 40 20 0 0 100 200 300 400 500 600 700 CPUs Execution Time 26
120 120 100 100 Grid Experimentations up to 621 CPUs on 5 sites Flow-Shop: 17 Jobs / 17 Machines Flow-Shop: 17 Jobs / 17 Machines 0,3 0,25 Time in in minutes 80 80 60 40 40 20 20 0 0 0 100 200 300 400 500 600 700 0 CPUs 0 100 200 300 400 500 600 700 CPUs Execution Time 0,2 0,15 0,1 0,05 % of explored search tree Execution Time 26 Total Work
120 120 100 100 Grid Experimentations up to 621 CPUs on 5 sites Flow-Shop: 17 Jobs / 17 Machines Flow-Shop: 17 Jobs / 17 Machines 0,3 0,25 Time in in minutes 80 80 60 40 40 20 20 0 0 Speedup anomalies [Lai 84, Li 86] 0 100 200 300 400 500 600 700 0 CPUs 0 100 200 300 400 500 600 700 CPUs Execution Time 0,2 0,15 0,1 0,05 % of explored search tree Execution Time 26 Total Work
Speedup Anomalies & Efficiency Parallel tree-based speedup may sometimes quite spectacular (> or < linear)[mans 95] Speedup Anomalies in BnB [Roucairol 87, Lai 84, Li 86] speedup depends of how the tree is dynamically built Efficiency: is a related measure computed as the speedup divided by the number of processors. 27
Speedup Anomalies & Efficiency Efficiency 1,2 Flow-Shop: 17 Jobs / 17 Machines Parallel tree-based speedup may sometimes quite 1 spectacular (> or < linear)[mans 95] 0,8 Speedup Anomalies in BnB [Roucairol 87, Lai 84, Li 86] 0,6 speedup depends of how the tree is dynamically built Efficiency: 0,4 is a related measure computed as the speedup divided by the number of processors. 0,2 0 90 140 190 240 290 340 390 440 490 540 590 640 CPUs 27
Grid'BnB: Results 28
Grid'BnB: Results Experimentally validate our BnB framework for Grids validity of organizing communications scalability on Grid (up to 621 CPUs on 5 sites) 28
Grid'BnB: Results Experimentally validate our BnB framework for Grids validity of organizing communications scalability on Grid (up to 621 CPUs on 5 sites) 28
Grid'BnB: Results Experimentally validate our BnB framework for Grids validity of organizing communications scalability on Grid (up to 621 CPUs on 5 sites) Problems: deployment on Grids is difficult dynamically acquiring new resources is difficult popularity of Grid'5000 cannot mix Grid'5000 and under-utilized lab desktops 28
Grid'BnB: Results Experimentally validate our BnB framework for Grids validity of organizing communications scalability on Grid (up to 621 CPUs on 5 sites) Grid middleware needs a better supporting of Problems: dynamic infrastructure deployment on Grids is difficult dynamically acquiring new resources is difficult popularity of Grid'5000 cannot mix Grid'5000 and under-utilized lab desktops 28
Agenda Context, Problem, and Related Work Contributions Branch-and-Bound Framework for Grids Desktop Grid with Peer-to-Peer Mixing Desktops & s Perspectives & Conclusion 29
Peer-to-Peer as Grid Middleware Grid Computing and Peer-to-Peer share a common goal: sharing resources [Foster 03, Goux 00] Grid related work: [Globus] providing computational resources - installing/deploying Grid middleware is difficult P2P related work: [Gnutella, Freenet, DHT] - focusing on sharing data & mono-application dynamic & easy to deploy 30
Peer-to-Peer as Grid Middleware Grid Computing and Peer-to-Peer share a common goal: sharing resources [Foster 03, Goux 00] Grid related work: [Globus] providing computational resources - installing/deploying Grid middleware is difficult P2P related work: [Gnutella, Freenet, DHT] - focusing on sharing data & mono-application dynamic & easy to deploy Objective: provide a P2P infrastructure for Grids and sharing computational resources 30
Which P2P Architecture? Master- (SETI@home) centralized - targets only desktops good for embarrassingly parallel Pure/Unstructured Peer-to-Peer (Gnutella) - flooding problems supports high-churn (good for Desktop Grids) supports many kind of application (data, computational) Hybrid Peer-to-Peer (JXTA) uses central servers limits the flooding - has to manage churn Structured Peer-to-Peer (Chord) - high cost for managing churn efficient for data sharing (Distributed Hash Table) computing task asking for a task Master sending a task...... Unstructured P2P B A Resource Y Hybrid P2P B Resource Y Peer A T' T'' T''' pending tasks sending a partial result R' B R'' R''' partial results Flooding A Resource Y Flooding Peers in acquaintance A Peer K54 Server N51 N48 N56 N42 B -> Y... B Resource Y m 2-1 0 N38 N1 computing task Peer Peer lookup(k54) N8+4 N14 Peers in acquaint N8 Message N14 query N32 m=6 N21 Finger table N8+1 N8+8 N8+16 N8+32 N14 N8+2 N14 N21 N32 N42 Peer Query 31
Which P2P Architecture? Master- (SETI@home) centralized - targets only desktops good for embarrassingly parallel Pure/Unstructured Peer-to-Peer (Gnutella) - flooding problems supports high-churn (good for Desktop Grids) supports many kind of application (data, computational) Hybrid Peer-to-Peer (JXTA) uses central servers limits the flooding - has to manage churn Structured Peer-to-Peer (Chord) - high cost for managing churn efficient for data sharing (Distributed Hash Table) computing task asking for a task Master sending a task...... Unstructured P2P Pure Peer-to-Peer is the most adapted B Needs to avoid the flooding problem A Resource Y Hybrid P2P B Resource Y Peer A T' T'' T''' pending tasks sending a partial result R' B R'' R''' partial results Flooding A Resource Y Flooding Peers in acquaintance A Peer K54 Server N51 N48 N56 N42 B -> Y... B Resource Y m 2-1 0 N38 N1 computing task Peer Peer lookup(k54) N8+4 N14 Peers in acquaint N8 Message N14 query N32 m=6 N21 Finger table N8+1 N8+8 N8+16 N8+32 N14 N8+2 N14 N21 N32 N42 Peer Query 31
Contribution & Positioning Grid Applications & Portals Branch-and-Bound API Grid Programming Grid Middleware Infrastructure Grid Fabric 32
Contribution & Positioning Grid Applications & Portals Branch-and-Bound API Grid Programming P2P Infrastructure Grid Middleware Infrastructure Grid Fabric 32
The Peer-to-Peer Infrastructure [CMST06] 33
The Peer-to-Peer Infrastructure [CMST06] Pure Peer-to-Peer overlay network Using it as Grid middleware infrastructure The proposed solution to avoid the flooding problem: 3 node-request protocols: 1 node: Random walk algorithm n nodes: Breadth-First-Search (BFS) algorithm with acknowledgement max nodes: BFS without acknowledgement Best-effort 33
INRIA Sophia P2P Desktop Grid Need to validate the infrastructure with desktops 260 desktops at INRIA Sophia lab No disturbing normal users: running in low priority working schedules: 24/24 50 machines (INRIA-2424) night/weekend 260 machines (INRIA-ALL) 34
INRIA Sophia P2P Desktop Grid Need to validate the infrastructure with desktops 260 desktops at INRIA Sophia lab No disturbing normal users: running in low priority working schedules: 24/24 50 machines (INRIA-2424) night/weekend 260 machines (INRIA-ALL) Deployed our P2P infrastructure as permanent Desktop Grid at INRIA Sophia 34
INRIA Sophia P2P Desktop Grid INRIA Sophia Antipolis - Desktop Machines Network INRIA Sub Network Need to validate the infrastructure with desktops 260 desktops at INRIA Sophia lab No disturbing normal users: running in low priority working schedules: 24/24 50 machines (INRIA-2424) night/weekend 260 machines (INRIA-ALL) INRIA Sub Network Deployed Desktop Machines our - INRIA-2424 P2P infrastructure Desktop as Machines permanent - INRIA-All Aquaintance Desktop Grid at INRIA Sophia 34 INRIA Sub Network INRIA Sub Network
Long Running Experiments with the P2P Desktop Grid Context: ETSI Grid Plugtests contest n-queens n-queens: embarrassingly parallel / CPU intensive / masterworker 35
Long Running Experiments with the P2P Desktop Grid Context: ETSI Grid Plugtests contest n-queens n-queens: embarrassingly parallel / CPU intensive / masterworker How many solution for 25 queens? 35
n-queens Experiment Results Total # of Tasks 12,125,199 Task Computation World Record [Sloane Integers Sequence A000170] What we learn from this experiments: validate the workability of the infrastructure validate the robustness of the infrastructure hard to forecast machine's performances 36 138'' Computation Time 185 days Cumulated Time 53 years # of Desktop Machines 260 Total of Solution Found 2,207,893,435,808,352 2 quadrillions
Agenda Context, Problem, and Related Work Contributions Branch-and-Bound Framework for Grids Desktop Grid with Peer-to-Peer Mixing Desktops & s Perspectives & Conclusion 37
Mixing Desktops & s [PARCO07] Grid'BnB Parallel BnB Framework for Grid P2P Infrastructure as Grid Midlleware 38
Mixing Desktops & s [PARCO07] Grid'BnB Parallel BnB Framework for Grid + P2P Infrastructure as Grid Midlleware API for solving COPs & Dynamic Grid Infrastructure 38
Mixing Desktops & s [PARCO07] Grid'BnB Parallel BnB Framework for Grid + P2P Infrastructure as Grid Midlleware API for solving COPs & Dynamic Grid Infrastructure Validate by experiments on a Grid of Desktops & s 38
Mixing Desktops & s [PARCO07] Grid'BnB Parallel BnB Framework for Grid + P2P Infrastructure as Grid Midlleware API for solving COPs & Dynamic Grid Infrastructure Validate by experiments on a Grid of Desktops & s New Problems: firewalls forwarder sharing clusters with the P2P infrastructure 38
Sharing 's Nodes with P2P Peer Host shared node Peer sharing local node 39
Sharing 's Nodes with P2P Host Host shared node... shared node Peer Host shared node Host shared node Host shared node Peer sharing local node Host Peer Peer sharing remote nodes 39
Testbed INRIA Sophia - Desktops G5K Platform Orsay GDX Lyon Sophia Toulouse 40
Testbed INRIA Sophia - Desktops G5K Platform Orsay GDX Peer Peer Node Node Lyon Peer Peer Node Node Sophia Toulouse Peer Node 40
Testbed INRIA Sophia - Desktops RMI/SSH G5K Platform Node Peer Peer Node Orsay GDXNode Node Node Node Node Peer Node Peer Node Node Lyon Node Peer Node Peer Node Node Node Sophia Node Node Node Node Peer Toulouse Peer Node 40
Large-Scale Experiments Goal: validate the infrastructure by experiments With n-queens: no communication between workers test the workability of the infrastructure With flow-shop: communications between workers test solving COPs 41
N-Queens Results Lyon Sophia Bord. Orsay Toulouse Total Number of CPUs for Experimentations 1007 859 572 380 250 24,49 39,01 63,68 190,63 238,68 1007 859 572 380 250 349 80 156 70 302 305 80 176 298 247 80 175 70 240 140 250 Lyon Sophia Orsay Lyon Sophia Bordeaux Sophia 50 0 60 120 180 240 300 Time in minutes 0 210 420 630 840 1!050 Total!of CPUs NQueens with n=22 INRIA-ALL G5K Sophia G5K Orsay G5K Lyon G5K Bordeaux G5K Toulouse 42
Flow-Shop Results Total Number of CPUs for Experimentations 628 346 321 313 220 201 80 61,15 59,14 56,03 61,31 83,73 86,19 628 346 321 313 220 201 125,55 80 42 116 62 80 56 290 58 173 90 57 256 57 163 55 146 Lyon 2456 Sophia Bord. Nancy Sophia Sophia Orsay Sophia Bordeaux Orsay Rennes 186 Toulouse 142 0 21,667 65,000 108,333 Time in minutes 0 140 280 420 560 700 Total # of CPUs Flow-Shop 17 jobs / 17 machines INRIA-2424 G5K Sophia G5K Orsay G5K Rennes G5K Lyon G5K Bordeaux G5K Nancy G5K Toulouse 43
Total Number of CPUs for Experimentations 628 346 321 313 220 201 80 Flow-Shop Results Sophia Bord. Nancy Rennes 61,15 628 59,14 346 83,73 321 56,03 313 61,31 201 125,55 80 2456 42 116 62 80 186 56 290 58 173 90 57 256 Sophia With desktops 86,19 220 & clusters 57 163 speedup anomalies occur Sophia more 55 Lyon Sophia Bordeaux 146 Orsay Orsay Toulouse 142 0 21,667 65,000 108,333 Time in minutes 0 140 280 420 560 700 Total # of CPUs Flow-Shop 17 jobs / 17 machines INRIA-2424 G5K Sophia G5K Orsay G5K Rennes G5K Lyon G5K Bordeaux G5K Nancy G5K Toulouse 43
Mixing - Analysis N-Queens problem scales well up to 1007 CPUs 349 Desktops + 5 s Flow-Shop up to 628 CPUs 42 Desktops + 5 s worse performances than using only clusters (anomalies are more frequents) Experimented in closed environments -- security Grid'5000 platform's success hard for running long exp. 44
P2P as a Meta-Grid infrastructure 45
P2P as a Meta-Grid infrastructure Observation from Grid'5000: 300 nodes available for 2 minutes provide best-effort queue 45
P2P as a Meta-Grid infrastructure Observation from Grid'5000: 300 nodes available for 2 minutes provide best-effort queue Idea: take benefit from these nodes [Condor] by hand: not easy with the P2P infrastructure: deploying peers in best-effort queue 45
P2P as a Meta-Grid infrastructure Observation from Grid'5000: 300 nodes available for 2 minutes provide best-effort queue Idea: take benefit from these nodes [Condor] by hand: not easy with the P2P infrastructure: deploying peers in best-effort queue Result: a permanent P2P infrastructure over Grid'5000 45
P2P as a Meta-Grid infrastructure Observation from Grid'5000: 300 nodes available for 2 minutes provide best-effort queue Idea: take benefit from these nodes [Condor] by hand: not easy with the P2P infrastructure: deploying peers in best-effort queue Result: a permanent P2P infrastructure over Grid'5000 P2P infrastructure as a dynamic Grid middleware 45
P2P with N-Queens on Grid'5000 1400 Experimentation: n22.log.40 1800 1200 1600 1400 # of CPUs 1000 800 600 1200 1000 800 # of Tasks 600 400 400 200 # of workers by minutes tasks computed by minutes 0 5 10 15 20 25 30 35 40 200 Experimentation time in minutes 1384 CPUs - 9 sites 46
P2P with N-Queens on Grid'5000 1400 Experimentation: n22.log.40 1800 1200 1600 1400 # of CPUs 1000 Embarrassingly applications can take benefit from 800 1000 Meta-Grid Infrastructure 600 1200 800 600 # of Tasks 400 400 200 # of workers by minutes tasks computed by minutes 0 5 10 15 20 25 30 35 40 200 Experimentation time in minutes 1384 CPUs - 9 sites 46
Agenda Context, Problem, and Related Work Contributions Branch-and-Bound Framework for Grids Desktop Grid with Peer-to-Peer Mixing Desktops & s Perspectives & Conclusion 47
Summary Grid'BnB: a BnB framework for Grids communication between workers organizing workers in groups Grid infrastructure based on P2P architecture mixing desktops and clusters deployed at INRIA Sophia lab 48
Perspectives Peer-to-Peer Infrastructure: Job Scheduler Resource Localization (PhD) Large-Scale Experiments: International Grid: France, Japan, and Netherlands Grid Pugtests Deployment: Contracts in Grids (GCM Standard) Industrialization: P2P Desktop Resource Virtualization CPER - P2P 1M /4years to professionalize the INRIA Grid 49
Conclusion 50
Conclusion Branch-and-Bound for Grids solve COPs hide Grid difficulties communication between workers 50
Conclusion Branch-and-Bound for Grids solve COPs hide Grid difficulties communication between workers 50
Conclusion Branch-and-Bound for Grids solve COPs hide Grid difficulties communication between workers Peer-to-Peer as Grid infrastructure mixing desktops and clusters 50
Conclusion Branch-and-Bound for Grids solve COPs hide Grid difficulties communication between workers Peer-to-Peer as Grid infrastructure mixing desktops and clusters 50
Conclusion Branch-and-Bound for Grids solve COPs hide Grid difficulties communication between workers Peer-to-Peer as Grid infrastructure mixing desktops and clusters Tested and experimented 50
Conclusion Branch-and-Bound for Grids solve COPs hide Grid difficulties communication between workers Peer-to-Peer as Grid infrastructure mixing desktops and clusters Tested and experimented Available in open source 50
Conclusion Branch-and-Bound for Grids solve COPs hide Grid difficulties communication between workers Peer-to-Peer as Grid infrastructure mixing desktops and clusters Tested and experimented Available in open source Provides framework and infrastructure to hide Grid difficulties 50
[SCCC05] Balancing Active Objects on a Peer to Peer Infrastructure Javier Bustos-Jimenez, Denis Caromel, Alexandre di Costanzo, Mario Leyton and Jose M. Piquer. Proceedings of the XXV International Conference of the Chilean Computer Science Society (SCCC 2005), Valdivia, Chile, November 2005. [HIC06] Executing Hydrodynamic Simulation on Desktop Grid with ObjectWeb ProActive Denis Caromel, Vincent Cavé, Alexandre di Costanzo, Céline Brignolles, Bruno Grawitz, and Yann Viala. HIC2006: Proceedings of the 7th International Conference on HydroInformatics, Nice, France, September 2006. [HiPC07] A Parallel Branch & Bound Framework for Grids Denis Caromel, Alexandre di Costanzo, Laurent Baduel, and Satoshi Matsuoka. Grid BnB: HiPC 07, Goa, India, December 2007. [CMST06] ProActive: an Integrated platform for programming and running applications on grids and P2P systems Denis Caromel, Christian Delbé, Alexandre di Costanzo, and Mario Leyton. Journal on Computational Methods in Science and Technology, volume 12, 2006. [PARCO07] Peer-to-Peer for Computational Grids: Mixing s and Desktop Machines Denis Caromel, Alexandre di Costanzo, and Clément Mathieu. Parallel Computing Journal on Large Scale Grid, 2007. [FGCS07] Peer-to-Peer and Fault-tolerance: Towards Deployment-based Technical Services Denis Caromel, Alexandre di Costanzo, and Christian Delbé. Journal Future Generation Computer Systems, to appear, 2007. and Workshops, Master report,...