Parallel Branch and Cut for Dummies

Similar documents
Branch, Cut, and Price: Sequential and Parallel

Parallel Scalable Algorithms- Performance Parameters

Performance metrics for parallel systems

Noncommercial Software for Mixed-Integer Linear Programming

Using diversification, communication and parallelism to solve mixed-integer linear programs

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

Performance metrics for parallelism

Cellular Computing on a Linux Cluster

Chapter 13: Binary and Mixed-Integer Programming

Spring 2011 Prof. Hyesoon Kim

Optimized Scheduling in Real-Time Environments with Column Generation

How to speed-up hard problem resolution using GLPK?

On the Capacitated Vehicle Routing Problem

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Special Session on Integrating Constraint Programming and Operations Research ISAIM 2016

The Gurobi Optimizer

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Load Balancing. Load Balancing 1 / 24

Interactive comment on A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc by W. He et al.

Load balancing; Termination detection

COORDINATION PRODUCTION AND TRANSPORTATION SCHEDULING IN THE SUPPLY CHAIN ABSTRACT

A Review of Customized Dynamic Load Balancing for a Network of Workstations

Integrating Benders decomposition within Constraint Programming

Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach

Cloud Branching. Timo Berthold. joint work with Domenico Salvagnin (Università degli Studi di Padova)

Optimal Vehicle Routing and Scheduling with Precedence Constraints and Location Choice

Various Schemes of Load Balancing in Distributed Systems- A Review

CHAPTER 1 INTRODUCTION

Load Balancing In Concurrent Parallel Applications

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

A Branch and Bound Algorithm for Solving the Binary Bi-level Linear Programming Problem

Efficient and Robust Allocation Algorithms in Clouds under Memory Constraints

GAMS, Condor and the Grid: Solving Hard Optimization Models in Parallel. Michael C. Ferris University of Wisconsin

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

INTEGER PROGRAMMING. Integer Programming. Prototype example. BIP model. BIP models

The Methodology of Application Development for Hybrid Architectures

A Constraint Programming based Column Generation Approach to Nurse Rostering Problems

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Big Data Optimization at SAS

Load balancing; Termination detection

Advanced Computer Architecture

Near Optimal Solutions

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Load Imbalance Analysis

System Models for Distributed and Cloud Computing

Resource Allocation and the Law of Diminishing Returns

Journal of Theoretical and Applied Information Technology 20 th July Vol.77. No JATIT & LLS. All rights reserved.

High Performance Computing for Operation Research

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Simplified Benders cuts for Facility Location

Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams

SBB: A New Solver for Mixed Integer Nonlinear Programming

Survey on Job Schedulers in Hadoop Cluster

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Strategic planning in LTL logistics increasing the capacity utilization of trucks

An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems

Parallelization: Binary Tree Traversal

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

Hyper ISE. Performance Driven Storage. XIO Storage. January 2013

Using computing resources with IBM ILOG CPLEX

Reliable Adaptable Network RAM

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

A Load Balancing Procedure for Parallel Constraint Programming

Physical Data Organization

Throughput constraint for Synchronous Data Flow Graphs

Binary search tree with SIMD bandwidth optimization using SSE

Performance Evaluation for Parallel Systems: A Survey

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

A Simple Method for Dynamic Scheduling in a Heterogeneous Computing System

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

TUTORIAL WHITE PAPER. Application Performance Management. Investigating Oracle Wait Events With VERITAS Instance Watch

NAMD2- Greater Scalability for Parallel Molecular Dynamics. Presented by Abel Licon

Performance Metrics for Parallel Programs. 8 March 2010

Performance Counters. Microsoft SQL. Technical Data Sheet. Overview:

Load balancing. David Bindel. 12 Nov 2015

PERFORMANCE MODELS FOR APACHE ACCUMULO:

D1.1 Service Discovery system: Load balancing mechanisms

Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

Streamline Integration using MPI-Hybrid Parallelism on a Large Multi-Core Architecture

Design and Implementation of a Massively Parallel Version of DIRECT

Load balancing. Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.26] Thursday, April 17, 2008

New Exact Solution Approaches for the Split Delivery Vehicle Routing Problem

Parallel Algorithm for Dense Matrix Multiplication

Hadoop Scheduler w i t h Deadline Constraint

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

Lecture Outline Overview of real-time scheduling algorithms Outline relative strengths, weaknesses

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Mining Interesting Medical Knowledge from Big Data

Clearing Algorithms for Barter Exchange Markets: Enabling Nationwide Kidney Exchanges

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

BIRCH: An Efficient Data Clustering Method For Very Large Databases

A Column-Generation and Branch-and-Cut Approach to the Bandwidth-Packing Problem

A dynamic optimization model for power and performance management of virtualized clusters

Solving NP Hard problems in practice lessons from Computer Vision and Computational Biology

Transcription:

Parallel Branch and Cut for Dummies Ted Ralphs Department of Industrial and Manufacturing Systems Engineering Lehigh University Bethlehem, PA 1

Outline of Talk Introduction to Parallel Systems Introduction to Tree Search Parallelizing Branch and Bound Parallelizing Branch and Cut The SYMPHONY Library Computational Results Conclusions 2

Parallel Systems Parallel System: Parallel algorithm + parallel architecture [Kumar and Gupta 94]. Scalability: How well a parallel system takes advantage of increased computing resources. Fixed problem size: Efficiency decreases with more processors (Amdahl s Law) [Amdahl 67]. Fixed number of processors: Efficiency increases with problem size. Isoefficiency: The rate problem size must be increased to maintain a fixed efficiency [Kumar and Rao 87]. Terms Sequential runtime Sequential fraction Parallel runtime Parallel overhead Speedup Efficiency T s s T p T o = NT p T s S = T s /T p E = S/N 3

Tree Search Types of Tree Search Problems - Simple search - Complete search - Optimal search (DOPs) Application Areas - Discrete Optimization - Artificial Intelligence - Game Playing - Theorem Proving - Expert Systems Elements of Search Algorithms - Node splitting method (branching) - Search order - Pruning rules 4

Scalability Factors for Tree Search General algorithmic factors - Serial fraction - Degree of concurrency - Parallel overhead - Unnecessary/duplicate work - Bottlenecks Factors specific to search algorithms - The ratio of U calc to U comm - Overall size of the search tree - Number of branches/children per node - Time to ramp up - Work Distribution Scheme Single work pool Multiple work pools 5

Parallel Branch and Bound Finding a feasible solution quickly is important. Diving strategies - Lower memory and communication requirements. - Less node set-up time. - Find feasible solutions quickly. - Wasted computation Best-first strategies - High memory and communication requirements - Additional node set-up time - Inhibit finding feasible solutions - Minimize the size of the tree Hybrid strategies - Controlled Diving (Cost vs. Fractional) - Heuristics (initial and primal) 6

Parallel Branch, Cut, and Price Global pools are an important additional factor. Node pools - Single pool easy maintenance, load balancing, and termination detection more accurate global picture can become a bottleneck - Multiple pools increase communication requirements difficult to load balance reduce saturation Cut and Column Pools - Must be scanned linearly - Single pool slow to scan when large (possibly) better global information - Multiple pools smaller and faster to scan contain more localized information more overhead 7

Performance Measures for Parallel Search Measures of redundant/unnecessary work - number of nodes in the search tree - average time to process a node Measures of idle time - time spent waiting for work - time spent waiting for cuts/columns from pool 8

SYMPHONY SYMPHONY is a generic framework for implementing parallel branch, cut, and price. It was designed specifically to run in parallel. - Can be built for serial, distributed, or shared-memory environments. - No knowledge of parallelism needed. - Runs on multiple platforms even mixed. User supplies: - separation subroutines, - the initial LP relaxation, - feasibility checker, and - other optional subroutines. SYMPHONY takes care of all other aspects of algorithm execution, including communication. Available at www.branchandcut.org. 9

The Processes of Parallel Branch and Cut parameters root node Master + store problem data + service requests for data + compute initial upper bound + store best solution + handle i/o request data send data Tree Manager GUI + display solutions + input user cuts feas sol n Cut Generator + generate cuts violated by a particular LP solution + maintain search tree + track upper bound + service requests for active node data node data upper bound Cuts LP Sol + process subproblems cut list request data send data + perform branching + check feasibility + send cuts to cut pool LP Solver copy cuts subtree is finished Cut Pool + maintain a list of "effective" inequalities + return all cuts violated by a particular LP solution new cuts LP solution violated cuts 10

Branching Branching decisions are critical, especially near the top of the tree. We want to minimize ramp up time while not increasing overall execution time difficult!! Can branch on cuts or variables. - Any number of children is allowed. - Branch on several left hand values for a constraint. Strong branching is indispensable. - Select several branching candidates (can be cuts, variables, or both). - Presolve each candidate. - Choose the best for branching. - More candidates are evaluated near the top of the tree. 11

Pool Management Node Storage - Candidate nodes stored in a single pool. - Try to minimize memory use while not increasing node communication time. - The complete state of the tree is stored. - Node descriptions are stored explicitly or w.r.t. parent. Cut Storage - Cuts are stored in a single pool or multiple pools in a compact, efficient form. - The cut pools are allocated dynamically and serve assigned subtrees. - Pools are purged when they become too large. Search Management - The search algorithm is a best-first search with controlled diving based on either cost or degree of fractionality. - Best-first allows selection of best node. - Diving avoids node set-up costs and allows feasible solutions to be found quickly. 12

Summary Results: Vehicle Routing Problem Tests were performed on a network of 3 workstations powered by 4 DEC Alpha processors each. The problems are easy- to medium-difficulty problems from VRPLIB and other sources. Idle time is the total time spent waiting for work. Cut generation was not parallelized and no initial upper bound was provided for these runs. Software was SYMPHONY 2.7b and CPLEX 6.5. Number of LP processes 1 2 4 8 # of nodes 6593 6691 6504 6423 WC time 2493 1281 666 404 WC/node 0.38 0.38 0.41 0.50 Idle time 12.68 66.97 208.34 497.45 Idle/node 0.00 0.01 0.03 0.08 13

Summary Results: Traveling Salesman Problem Tests were performed on a network of 3 workstations powered by 4 DEC Alpha processors each. The problems are easy- to medium-difficulty problems from TSPLIB. Idle time is the total time spent waiting for work. Cut generation was not parallelized and no initial upper bound was provided for these runs. Software was SYMPHONY 2.7b and CPLEX 6.5. Number of LP processes 1 2 4 8 # of nodes 3405 3041 2917 2901 WC time 18119 10057 7445 5199 WC/node 5.32 6.60 10.21 14.34 Idle time 24.34 935.42 3527.61 10767.89 Idle/node 0.01 0.31 1.21 3.71 14

Another Experiment Comparison of full diving to controlled diving. Number of LP processes 1 2 4 nodes 5197 4224 4091 WC 23695 12558 6912 WC/node 4.56 5.95 6.76 nodes 3405 3041 2917 WC 18119 10057 7445 WC/node 5.32 6.60 10.21 15

Further Experiments Diving strategy VRP TSP Nodes Time Nodes Time w/ UB 3524 241 2720 6585 w/o UB 6504 666 2917 7444 Parallel Overhead Nodes Time Time/Node Parallel 7014 24558 3.50 Sequential 7014 25446 3.63 Shared vs. Distributed Memory Nodes Time Time/Node Shared 8896 2975 1.34 Distributed 9701 3392 1.40 16

Conclusions and Future Work Architecture does not play a very big role. Ramp-up time is not a problem. Unnecessary/duplicate work is not an issue. The cut pools are not significant bottlenecks. The single biggest factor in loss of speedup is the bottleneck created by the tree manager. Multi-pool approaches are well-studied, but tough to implement (see [Eckstein 94], [Gendron and Crainic 94]). A single tree manager with several slave node pools might be a good compromise (ala [Eckstein 94]). A more decentralized approach may also be undertaken. 17