PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING

Transcription

1 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING MARKUS HEGLAND Abstract. Parallel computing enables the analysis of very large data sets using large collections of flexible models with many variables. The computational methods are based on ideas from computational linear algebra and can draw on the extensive research on parallel algorithms in this area. Many algorithms for the direct and iterative solution of penalised least squares problems and for updating can be applied. Both methods for dense and sparse problems are applicable. An important property of the algorithms is their scalability, i.e., their ability to solve larger problems in the same time using hardware which grows linearly with the problem size. While in most cases large granularity parallelism is to be preferred, it turns out that even smaller granularity parallelism can be exploited effectively in the problems considered. The development is illustrated by four examples of nonparametric regression techniques. In a first example, additive models are considered. While the backfitting method contains dependencies which inhibit parallel execution it turns out that parallelisation over the data leads to a viable method, akin to the bagging algorithm without replacement which is known to have superior statistical properties in many cases. The second example considers radial basis function fitting with thin plate splines. Here the direct approach turns out to be non-scalable but an approximation with finite elements is shown to be scalable and parallelises well. One of the most popular algorithms in data mining is MARS (Multivariate Adaptive Regression Splines). This is discussed in the third example. MARS has been modified to use a multiscale approach and a parallel algorithm with a small granularity has been seen to give good results. The final example considers the current research area of sparse grids. Sparse grids take up many ideas from the previous examples and, in fact, can be considered as a generalisation of MARS and additive models. They are naturally parallel when the combination technique is used. We discuss limitations and improvements of the combination technique. 1. Introduction 1.1. Data, Functions and Computers. The availability of increasing computational power has lead to new opportunities in data processing. This includes the computational ability to handle larger and growing data sets and fitting increasingly complex models. In this review we aim to illustrate this development by discussing several recently developed methods and algorithms. These developments have resulted from joint and independent efforts of several communities. These include statisticians with an interest in computational performance, numerical linear algebraists interested in data processing, the more recently emerging data mining community with interests in algorithms, the computational mathematics community interested in applications of approximation theory, the diverse parallel computing community, machine learners with a computational orientation interested in very large scale problems, and many computational scientists especially the ones interested in the computational solution of partial differential equations. Many other 1

2 2 MARKUS HEGLAND communities which I have not mentioned here have contributed to this exciting development as well. I hope the following discussion will be of interest to members of many of these communities. The computational techniques discussed in the following have originated at various places and the parallel algorithms covered have all been the topic of extensive research conducted at the Australian National University. For anyone interested in trying out these methods there are many data sets available which allow investigating the performance of computational techniques [7, 21, 44]. While some of these data sets are not large by todays means they are still useful for scalability studies. Software is available for many of the methods discussed, and while well established companies have constantly been able to improve their standard software, maybe some of the most exciting developments in the area of algorithms and large scale high performance computing software has been happening in the open source community driven by publicly funded university research. This review does not review statistical issues. Even though statistical issues are extremely important, they are covered in many excellent statistical texts, including the recently published [35]. Here we will focus on computational aspects and the algorithms, their computational capability to handle very large data sets and complex models and the way in which they harness the performance of multiprocessor systems. Predictive modelling aims at finding functional patterns in data. More specifically, given two sequences x (1),..., x (N), and y (1),..., y (N) defining the data, predictive modelling aims to find functions such that y = f(x) y (i) f(x (i) ). While the type of the predictor x can be fairly arbitrary, including real and binary vectors, text, images, and time series, we will focus in the following on the case where x is an array containing categorical (discrete) and real (continuous) components. The response y will be considered to be real, i.e., we will be discussing algorithms for regression. Many of the observations and algorithms can be generalised to other data types for both the predictors and the response. In particular, the algorithms discussed here are also used for classification, see, e.g. [35]. We will be concerned here with the time it takes to solve a particular regression problem. The time depends firstly on the data size, modelled by N above. The data available in data analysis and, in fact, the data which is actually used in individual studies has been growing exponentially. It has been suggested that this growth is akin to the growth in computational performance (according to Moore s law). GByte sized data sets are very common, TByte data sets are used in large studies and PByte data sets are emerging. Software and algorithm development, however, is done at a slower pace. This discrepancy between the speed of software development and the simultaneous growth of data sizes and computational power is not a problem if the software is able to adapt to new data sizes and hardware. This adaptability is referred to as scalability. Fundamentally scalability of algorithms and software means that the same algorithm can be used to deal with larger data sets if the computational power is increased accordingly.

3 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 3 There are two fundamentally different ways to increase computational throughput. One way is to increase the speed (clock cycle, memory and communication bandwidth, latencies) of the hardware and the other way is to use multiple (often identical) hardware components which work simultaneously. In order to make full usage of this hardware parallelism one needs special algorithms and this is the focus of the following discussion. Ideally, the throughput of any computation grows linearly with the amount of hardware available. The amount of disk space required grows linearly with the data size and the work required to read the data grows linearly as well. Thus the computational resources to analyse a data set at the least have to grow linearly in the data size N. Consequently O(N) is the optimal resource complexity and we will in the following assume that all the hardware resources available grow linearly with the data size. The O(N) hardware complexity has important implications for the feasible computational complexity of the algorithms. We can assume that the computations required are close to the limits of what is feasible. One would typically not allow for an increase in computational time even the data sets grow. However, if we allow the number of processing units to grow linearly in the size of the data that means that in the ideal case the time to process the data stays constant if the algorithms used have O(N) complexity. Again this computational complexity is optimal as one at least needs to scan all the data once. This linear complexity of the algorithm is called scalability with respect to the data size. Besides the data size another determining factor for performance is model complexity. This complexity relates to the number of components of the predictor variable vector x, and the dimension of the function space. While in earlier parametric linear models there were just a handful of variables and parameters to determine, modern nonparametric models and, in particular, data mining applications consider thousands up to millions of parameters. Finally, the function is usually drawn from a large collection of possible candidate functions. While one was interested earlier in just the best of all models, in data mining applications one would like to find all models which are interesting and supported by the data. Thus the size of the candidate set is a third determining factor of performance. The algorithms considered here are characterised by the function class. A simple way to generate spaces of functions with many variables is to start with spaces of functions of one variable and consider the tensor product of these spaces. In terms of the basis functions, for example, if V i are spaces of functions of the form f(x i ) = j u i,jb i,j (x i ) then the functions in the tensor product d i=1 V i consists of functions of the form f(x) = j 1,...,j d u j1,...,j d b 1,j1 (x 1 ) b d,jd (x d ). If (for simplicity) we assume that each component space V i has m basis functions b i,j (x i ) then the tensor product space has m d basis functions. As the basis is local we expect that this expansion will contain essentially all basis functions for most functions of interest. For example, if tensor product hat functions are used for a basis, then one has for the constant function f(x) = 1 the expansion: 1 = j 1,...,j d b 1,j1 (x 1 ) b d,jd (x d ).

4 4 MARKUS HEGLAND This problem has been called the curse of dimensionality and it occurs in many other situations as well. In the following we will consider function classes which, while often subsets of the tensor product space, use a different basis and are able to deal effectively with the curse of dimensionality. In order to find interesting functions one needs to measure success, using a loss function, error criterion or cost function. This cost function has also a tremendous influence on the development of the algorithm and the performance. For regression problems the simplest loss function used is the sum of squared residuals N i=1 ( f(x (i) ) y (i)) 2. This is the cost function we will mainly use here. Many other measures are used including penalised sum of squares, and median absolute deviation. As always, one should be very careful when interpreting this function, in particular if the function has been determined by minimising this sum. The sum of squares, or so-called resubstitution error is in this case a very poor measure for the actual error of the fitted function as it will seriously underestimate the actual error. More reliable measures have been developed including hold-out data sets, crossvalidation and bootstrap. The data is often thought of as drawn from an underlying theoretical data distribution. If we denote the expectation with respect to this distribution by E( ) the expected mean squared residual is E ((f(x) Y ) 2) where X and Y are the random variables observed in the data. The best function f would be the one which minimises this expectation. This theoretical best function is denoted as a conditional expectation: f(x) = E(Y X = x). The algorithms we discuss here all aim to approximate this best function. Parallelism is a major contributor to computational performance. Parallelism is exploited by splitting the instruction and data streams into several independent branches. Hardware supporting this parallelism ranges from different stages of arithmetic pipelines, multiple memory paths, different CPUs with a shared memory, multiple independent processors with their own memories each, collections of workstations and grids of supercomputers. In general, any computation will involve parallelism at many different levels. Slightly simplified, one has p streams of operations which run in parallel instead of having to run sequentially. A fundamental characteristic of parallel execution is parallel speedup which is defined as the ratio of the time for sequential execution to parallel execution S p = T sequential T parallel. Ideally, if one has p streams running in parallel the speedup is p. Thus one introduces parallel efficiency as the ratio of actual speedup to the number of streams or parallel processes: E p = S p p.

5 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 5 Of course one cannot do everything in parallel, some tasks require that other ones are completed first. This dependency of different tasks is often present in iterative methods which determine a next step based on previous ones. In simple cases this dependency can be overcome. For example the determination of a sum N i=1 x(i) can either be done with a recursion s 0 = 0, s i+1 = s i + x (i) or by determining first the sums of each consecutive data points to get a new sequence then repeat this until finished: z i 1,0 = x (i), z i,k+1 = z 2i,k + z 2i+1,k. This recursive doubling algorithm does increase the amount of parallelism considerably especially for large data sizes N. The number of operations has not increased but the amount of parallelism is limited. Rather than completing at a time of approximately (N 1)/pτ this method would require ( ) N p T parallel = + log p 2 (p) τ where τ is the time for one addition. Thus the efficiency in this case is 1 E p =. 1 + plog 2(p) p+1 N 1 Note that this situation is not uncommon in parallel processing, another simple case is reflected in Amdahl s law where only a proportion of the total computation can be done in parallel. If, say, r is the ratio of work which can only be done sequentially then one gets at best an efficiency of 1 E p = 1 + (p 1)r. The speedup is limited to 1/r and so the efficiency drops to 0 for very large p. As every computation will have some proportion which cannot be shared among other processors this is a fundamental limitation to the amount of speedup available an thus to the computational time. This may be overcome by considering larger problems which have have a smaller proportion r of sequential tasks. A second example of dependency which can be overcome is the solution of a linear recursion: z 0 = a, z i+1 = b i z i + c i. This can be done by interpreting the recursion as a linear system of equations which can then be solved with a parallel algorithm (we will talk about this more later). However, this comes at a cost and it turns out that in this case the time required is roughly doubled. These extra operations which are required to make the algorithm parallel are called parallel redundancy. This situation is again very common and for an algorithm where the amount of work is increased by a factor of r will at most be able to achieve a parallel efficiency of E p 1 r. Note that while the efficiency is bounded at least it does not go to zero like in the previous case. If enough hardware parallelism p is available this may still be a very interesting option as there is no bound on the smallest amount of time required.

6 6 MARKUS HEGLAND For both cases of overcoming dependencies the order of the operations and even the nature of the operations where changed. While this would typically give the same results if exact arithmetic is used due mainly to the fact that the associative law of addition this is not the case for floating point arithmetic. A careful analysis of the parallel algorithms is required to guarantee that the results of the computations are still valid. This is in particular the case for the second case where the parallel algorithm can even break down! There are, however, often stable parallel versions of the algorithms available, see [43] for some examples which generalise the second example. While in principle an efficiency higher than one is not possible there are cases where the parallel execution can run more than p times faster on p processors than on one processor. This happens for example when the data does not fit in the memory of one processor but does fits in the memory of p processors. In this case one requires the loading of the data from disk during the execution on one processor, say, in p chunks. If one counts the time which it takes to load the data from disk into memory in the parallel execution as well one can confirm that in this case E p 1 as well. However, this example points to an important aspect of parallel processing: The increase in available storage space. Not only are the computational processing capabilities increased in a parallel processor but also the memory, caches and registers. This increase in storage can have a dramatic effect on processing times. Maybe the main inhibitor of parallel efficiency is problem size. Obviously one needs at least p tasks to get a speedup of p. However, smaller problem sizes do not require the application of parallel processing. Often one would like to consider various problems with different problem sizes and one is interested in maintaining parallel efficiency over all problems. The question is thus can we maintain parallel efficiency for larger numbers of processors if we at the same time increase the problem size. One chooses the problem size as a function of the amount of parallelism p. This is called parallel scalability. In the case of the determination of the sum of N numbers one gets scalability if N(p) = O(p log 2 (p)) and many other cases are scalable. Note that the case of Amdahl s law does give a hard limit to scalability, so in this case no problem size will help to maintain efficiency as speedup is limited to 1/r if r is the amount of sequential processing required. When mapping a collection of tasks onto multiple resources a speedup of p can only be obtained if all the tasks take the same time the time for the parallel execution is equal to the time required for the longest task. This is the problem of load-balancing the work. Thus one has to make sure that the size of the tasks are all about the same. One way to achieve this is to have very many tasks and none extremely large. The tasks can either be preallocated based on some estimate or, in many cases more efficiently, be allocated in real time using a master-worker model. A typical parallel computing task will consist of periods where several independent subtasks are processed in parallel and other periods where information is exchanged between the processors. The average volume of work which is done between the communication periods determines the granularity of the task. In the case of small granularity the communication overhead or latency will strongly influence the time and so one would aim to do larger tasks between data interchange steps. In addition to the communication overhead one finds that some parts of

7 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 7 the communication are independent of the granularity, thus again larger granularity would show better overall performance due to a reduction in communication volume. Finer granularity is used at the level of pipelines, vector instructions and SIMD (single instruction multiple data) processing where larger granularity is usually preferred for collections of workstations and MIMD (multiple instruction multiple data) processing. The effect of the hardware on the time to execute an algorithm is mainly determined by the access times of various storage devices, including registers, several levels of cache, memory, secondary storage, distributed storage in a cluster and distributed storage across the Internet. These memory hierarchies are defined by access speeds. The fastest computations are the ones which rely for their data movements almost entirely on the faster memories and only occasionally have to access the slower memories. The reason for the existence of these hierarchies is because of the enormous price difference in the various types of memories. The amounts of slower memory available on any computing system will dwarf the amounts of fast memory available. Any algorithm which requires frequent disk access will be much slower than a similar algorithm which mostly accesses registers. An example for slow data access is the data transfer between different processors without shared memory. Here the programmer directly has to deal with the the data movements. The transfer of data between different processors is called communication and the most popular way to implement this is using message passing. While the message passing routines do include various complex patterns of data interchange they all rely on the two primitives send where one processor makes data available to another processor and receive where the a processor accepts data from another processor. Thus any data transfer requires two procedure calls on two different processors. This provides a simple means of synchronisation. Software using message passing are even popular on shared memory processors due to their portability and the simple synchronisation mechanism. (Usually there is only a small time penalty through the application of message passing.) The time for message passing is modelled to be proportional to the data size plus some latency as: t communication = τ + βn where n is the amount of data transferred (typically in MByte), β is the communication bandwidth and τ the latency. As mentioned earlier, it is important to communicate data in large batches to avoid the penalty occurred through the latency. Another attractive feature of message passing is that there are standard message passing libraries available, in particular MPI [31], see edu.au/~ole/pypar for a binding in Python. Message passing is used for tasks which are distributed among processors in a multiprocessor system or a collection of workstations. In the case of a grid containing several multiprocessor systems even larger granularity is required. On this level data communication would be done using network data transfer commands (like scp and rsync ) and the tasks should have very large granularity. Tools assisting with the access of computational grids can be found in the Globus toolkit Running the jobs in batch mode on a collection of multiprocessors is appropriate for this large granularity, for an example of an open source implementation using an R language frontend, see

8 8 MARKUS HEGLAND 1.2. Concepts from Numerical Linear Algebra. We consider the problem of fitting data sets using functions from linear function spaces by (penalised) least squares methods. This problem has two computational components, the search for a good approximating subspace and the solution of a (penalised) least squares problem. We will review some of the developments and ideas from parallel numerical linear algebra which are relevant to the following discussion. Numerical linear algebra is one of the main areas of parallel algorithm development. A cumulation of this work was the the release of the LAPACK and ScaLAPACK software [2, 9]. The solution techniques can be grouped into two major classes: Iterative methods and direct methods. These methods are applied either to the residual equations Ax b = r or to the corresponding normal equations A T Ax = A T b. For the normal equations iterative methods require the determination of repeated matrix vector products y = Mx with the same matrix M = A T A but with varying x. In the case of dense n n matrices M this product requires O(n 2 ) arithmetic operations. All the n 2 real multiplications required for this could be done in parallel in one step but the additions require at least log 2 (n) 1 steps of recursive doubling. Thus the matrix vector product could be done in log 2 (n) steps if we neglect communication and have n 2 processors. However, in the cases considered here n is large so that it is unrealistic to assume the availability of O(n 2 ) processors. For realistic parallel algorithms, the matrix is partitioned into blocks: M = M 1,1 M 1,t.... M s,1 M s,t and with according blocking of the vector x the matrix vector product has the form y k = t M k,j x j for k = 1,..., s. j=1 So now one uses st processors for the parallel multiplication step and successively less for the collective log 2 (t) summation steps. Special variants which have been considered are the cases of t = s, and t = 1 and s = 1. The variants have slightly different characteristics. For example, in the case of t = 1 no reduction step and loss of parallelism occurs. However, the whole vector x needs to be broadcast at every step. There is no need for broadcast in the case of s = 1 but the reduction operation is more dominant in this case. An analysis shows that the communication costs are similar for all cases. The choice of st (which can be larger than p influences the performance as well. See [45, 59] for a further discussion of these issues. The cases of s = 1 and t = 1 are especially of interest for sparse matrices as this provides better load balancing due to the fact that these cases do not require even distribution of the nonzero elements over the whole matrix but only over the rows or columns.

9 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 9 The effect of granularity has been considered in-depth in the research leading to the ScaLAPACK and LAPACK libraries on direct solvers. The underlying algorithms including Gaussian elimination, Cholesky factorisation and QR factorisation can be implemented as a sequence of SAXPY operations which combine two vectors (e.g., rows of a matrix): z = αx + y. where x and y are vectors and α R. Here all the products and additions can be done in parallel and the average amount of parallelism is O(n). This fine grain parallelism even in the case of vector processing suffers from the fact that there are 3 memory accesses per every 2 floating point operations. However, these operations have been fundamental in the further development, they form part of the level 1 BLAS subroutines. Now one can group n subsequent updates into one matrix update. This is the basis for matrix factorisations and the update step is here a rank one modification of a matrix: A = A + αxy T. Here the total amount of parallelism is the same as in the case of a matrix vector product and it turns out that the performance and granularity basically the same. This is the case of a level 2 BLAS routine and it allows efficient usage of vector registers. This is considered to be of intermediate granularity in numerical linear algebra. Note, however, that the amount of data to be accessed is proportional to the number of floating point operations performed (O(n 2 )). Highest granularity is obtained when matrix-matrix products are used. In the case of the solution of linear systems of equations this means that operations of the type A = A + BC 1 D occur. Such operations allow the effective usage of various caches and efficient parallel execution. Here one sees the volume to surface effect of parallel linear algebra, the number of operations performed here are O(n 3 ) while the amount of data accessed is O(n 2 ). This is one aspect of the large granularity of the computation. These computational building blocks have been optimised for many computational platforms architectures and implementations of methods which adapt to different architectures are available [61]. There are various approaches to the implementation on parallel architectures, see, e.g., [9, 56]. Granularity issues are important for iterative methods, and for direct methods both using the normal equations and the QR factorisation of the design matrix A. The approach using normal equations has two computational steps: (1) Determine the normal matrix M = A T A and A T b (2) Solve the normal equations A T Ax = A T b In our case these two steps have totally different characteristics. In the first step the data is accessed. This step is scalable with respect to the data size and parallelises well if the data is equally distributed as A 1 A 2 A =... A p

10 10 MARKUS HEGLAND The partial normal matrices M i = A T i A i are determined on each processor and then assembled using a recursive doubling method to give: p M = M i. i=1 This approach assumes that the complexity of the model is such that the matrix M can be held in memory of one processor. We will see examples where further structure of this matrix can be exploited to distribute it onto several processors. This scheme is very effective for very large data sets and large models. The solution step can now be parallelised as well in order to deal with the O(n 3 ) complexity. It is often pointed out that the QR factorisation approach is more stable and will give better results than the normal equation approach. It turns out that this approach can also be performed in two steps: (1) Determine the QR factorisation A = QR and Q T b (2) Solve Rx = Q T b Again only the first step is data dependent. It can be parallelised by partitioning the data as in the case above. Then QR factorisations are performed on each processor giving A i = Q i R i and Q T i b. The results of this, i.e, R i and Q T i b can now be exchanged and an overall QR factorisation is obtained using again a recursive doubling method. As the amount of parallelism used by the recursive doubling one can now utilise some of the internal parallelism of the factorisations by applying the BLAS level 3 approaches discussed above. So far we have considered the parallel solution of least squares problems. The algorithms considered here have another aspect as the actual model will be selected from many candidates. In order to compare several candidates one has to solve many different least squares problems. Often the algorithms used take the form of greedy algorithms where a certain model is increased by some extra degrees of freedom and various ways to increase a given model are compared. Thus one solves many problems with residual equations for fixed y and A x y = r A = ( A B ) for a given A and many different B. The normal matrix for this case is [ ] A T A A T B B T A B T. B Now the update requires a further scan of all the data to determine A T B and B T B which can be done in parallel like before. In addition one can do various updates for several possibilities B in parallel. The Cholesky factorisation can also be updated efficiently, see, e.g. [28]. Note, however, that the granularity of the update is substantially less than the granularity of the full factorisation. We have so far focused on the case of dense linear algebra. This is an important case and, in fact, many of the matrices occurring in predictive modelling are dense, e.g. when linear combinations of radial basis functions are fitted or for additive models. However, for very large scale computations the complexity of dense matrix

11 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 11 computations is not feasible, and, in fact, we will see how such dense matrices can be avoided. So one has to ask the question if the principles discussed above are still applicable in the sparse matrix case. It turns out that this is indeed the case. The application of dense linear algebra for the direct solution of least squares problems has been well established for the multifrontal methods [1]. These methods make use of the fact that during elimination procedure the original sparsity of the matrix tends to gradually disappear as more and more dense blocks appear. This is certainly the case for the simplest sparse matrices, the block bidiagonal matrices which have the form A 1 B 2 A 2 B 3 A The assembly stage is not fundamentally different from the dense case. In the factorisation the amount of parallelism seems now confined by the inherent dependency of the factorisation which proceeds in a frontal fashion along the diagonal. Each stage of the factorisation will consider a new block row with B i+1 and A i+1 and will do a dense factorisation on these blocks. Thus the parallelism in this factorisation can be readily exploited. In addition, one can see that by an odd-even permutation (i.e. take first all the odd numbered block rows and columns and then the even numbered ones) one gets a block system with an extra degree of parallelism. This approach is very similar to the odd-even reduction approach to summation. However, it turns out that the amount of floating point operations in this approach is higher than in the original factorisation. For more details on this, see [43, 3, 40]. In the case of more general matrices one can observe very similar effects. However, instead of having a front and dense factorisations with a constant size one now has multiple fronts which subsequently combine to form ever larger dense matrices. In addition to the dense parallelism one has also the parallelism across various fronts which is of the same nature as the parallelism for the block bidiagonal case. There is also the potential of trading of some parallelism with redundant computations. See the [1, 20] for more details. These multifrontal methods have originated in the solution finite of element problems. For finite element problems one has also two stages very similar to the two stages (forming of normal equations and solution) for the least squares problems. It turns out that, like in the case of least squares problems the assembly can be viewed as an elimination step for the augmented system of equations. The assembly and the factorisation of the matrix can be combined to one stage like in the case of the QR factorisation.. 2. Additive Models 2.1. The Backfitting Algorithm. Additive models are functions of the form f(x 1,..., x d ) = f 0 + d f j (x j ). For this subclass the curse of dimensionality is not present as only O(d) storage space is required. The constant component f 0 is obtained as an average of the j=1

12 12 MARKUS HEGLAND values of the response variable: f 0 = 1 N N y (i). For the estimation of the f j one-dimensional smoothers can be used. A smoothing parameter controls the tradeoff between bias and variance. The choice of the value of the smoothing parameter is essential and has been extensively discussed in the literature, see [35, 32] for an introduction and further pointers. Using a smoother S j the f j are defined as the solution of the equations i=1 (1) f j = S j (y f 0 i j f i x j ) where f j are the vectors of values of f j at the data points x (i) j, i = 1,..., N or f j = f j (x j ). Uniqueness of the solution is obtained through orthogonality conditions on the f j. This choice of the f j is motivated by the fact that the the functions which minimise expected squared residuals satisfy f e j (x j ) = E(Y f e 0 i j f e i (X i ) X j = x j ) and the equations are obtained by replacing the expectation by the smoother. Alternatively, the fixpoint equations are obtained from a smoothing problem for additive models [34, 60]. The simple form of these equations where in the j th equation the variable x j only occurs explicitly on the left hand side suggest the usage of a Gauss-Seidel or SOR process [46] for their solution. More specifically, for the Gauss-Seidel case a sequence of component functions fj k is produced with The iteration step is then f k+1 j f 0 j = S j (y f 0 x j ). = S j (y f 0 i<j f k+1 i i>j f k i x j ) where fj k = f j k(x j). This is the backfitting algorithm, see [34, 33, 35] for more details and also a discussion of the the local scoring algorithm which is used for generalised additive models. In case where the smoother S j is linear one introduces matrices S j (which depend on x j ) such that f j = f j (x j ) = S j (y x j )(x j ) = S j y. As a consequence of the fixpoint equations (1) one obtains a linear system of equations for the function value vectors f j : I S 1 S 1 f 1 S 1 (y f 0 ) S 2 I S 2 f 2.. (2) = S 2 (y f 0 )... S d S d I S d (y f 0 ) The backfitting procedure leads to an iterative characterisation of the f k i : f d f 0 j = S j (y f 0 )

13 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 13 and f k+1 j = S j (y f 0 i<j f k+1 i i>j f k i ). This is formally a block-gauss-seidel procedure for the component vectors and the established convergence theory for these methods can be applied. In particular, convergence is obtained if the matrices S j have eigenvalues in (0, 1). In fact, it turns out that in many practical applications backfitting converges in a couple of iterations. Note, however, that typically the block matrix is singular so that standard convergence theorems do not apply directly but require a simple modification. In general, the backfitting algorithm produces a sequence of functions f 1 1,..., f 1 d, f 2 1,..., f 2 d, f 3 1,.... Each element of the sequence depends on the d immediately preceding elements. For the determination of one element one smoother S j is evaluated. Thus the complexity of the determination of one element in this sequence is O(N). For the function d one requires d consecutive elements f j. The complexity of the determination of f1 k,..., fd k is thus O(kdN). Due to the dependency of the elements in the sequence one cannot determine multiple elements of the sequence simultaneously. A parallel implementation of the backfitting procedure would thus have to exploit any parallelism which is inherent in the smoother S j and in the evaluation of y f 0 i<j f k+1 i i>j f i k. While this is possible we will in the following discuss a different approach which can be viewed as an alternative to the ordinary backfitting. There are other alternatives to the parallel algorithm discussed here. One simple alternative is based on a Jacobi or fixpoint iteration. For this method the sequence of functions is defined by f k+1 j = S j (y f 0 i j f k i x j ). This substantially decreases the convergence rate but allows the simultaneous determination of d functions f k+1 j. One can improve convergence by accelerating this procedure with a conjugate gradient method. There are, however, additional problems with this approach. First the approach leads to a parallelism of at most degree d. But a speedup of d may not be possible due to inbalanced loads as it is unlikely that the time for all smoothers S j is the same. This is true in particular if one includes binary and categorical predictor variables. Load balancing would also be typically lost if one includes interaction terms or functions f ij (x i, x j ), inclusions which do not require major modifications of the backfitting method but which may improve predictive performance. In the case of lower levels of parallelism p and large d on can get some load balancing using a master-worker approach where each processor can demand another component once it has completed earlier work. Such a dynamic load-balancing approach is well suited to a distributed computing environment. However, maybe the main problem with this approach of parallelising over component functions f j is that each processor will require reading a large proportion of all the data. Thus the memory available limits the feasibility of this approach. An out-of-core method may be considered but this can introduce substantial overheads as the data needs to be transfered between secondary store and memory as each iteration needs to visit all the data. Thus for the data intensive iteration steps of backfitting one

14 14 MARKUS HEGLAND would prefer to have all the data points in memory. A difficulty of the parallel Jacobi method is also that at each step of the iteration data communication needs to be performed to move the new f i (or at least their sum) to all the processors. Further parallel iterative methods can be developed by applying known methods from linear algebra [51] to the block linear system of equations 2. An issue one has to deal with is the singularity of the block matrix as most iterative methods are defined for nonsingular matrices A Parallel Backfitting Algorithm. A parallel backfitting algorithm is required for time-critical applications, when one deals with very large data sets and for the case of slow convergence. Here we discuss an algorithm which is based on data partitioning. The proposed method has applications outside of parallel processing as well as it can be used for the case where the data is distributed over multiple repositories and for the case where not all the data fits into main memory. Data mining applications commonly do require processing of very large data sets, which may be distributed, will not fit in the memory, and have many variables. Thus data mining predictive modelling problems are applications which can ideally profit from the parallel algorithms discussed here. The algorithm is based on a partition of the data into p subsets. The elements of subset r are denoted by x [r] 1,..., x[r] d, y[r] where r = 1,..., p. Using the same smoothers for each subset and the backfitting algorithm one obtains p partial models f [r] j for j = 0,..., d and r = 1,..., p. The final model is obtained as a weighted sum f j = p r=1 w r f [r] j for some weights w r. The weights will characterise the influence of each partial component on the overall model. They should satisfy w r 0 and p w r = 1. r=1 They might include some measure of reliability of each estimate, for example they could just be N r /N where N r is the number of data points in subset r. One can also iteratively refine this method, first fitting the data in this way, then the residuals etc. If the subset r requires k r iterations to find the partial model the time for this subset is O(dk r N r ), say dk r N r τ 1. If all the subsets are processed in parallel the total time for this step is then T 1 = d max k r N r τ 1. r The determination of the sum is a reduction operation. If we assume that the subsets are distributed one can do this reduction in log 2 (p) steps by first adding the even elements to the preceding ones to get an array of models of size p/2 then repeating this step until only one model is left. As all these models are of the same

15 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 15 size the time to do this step is T 2 = log 2 (p) d m j τ 2 for some constant τ 2 and where m j is the complexity of the model f j. Note that in the case of a shared memory one could make use of the higher parallelism by adding the components in parallel and one would get for this case (if enough processors are available) T 2 = log 2 (p) max j m j τ 2. The speedup obtained through this algorithm is dknτ 1 S p = d max r k r N r τ 1 + log 2 (p) d j=0 m. jτ 2 While the summing up part (T 2 ) is important for very large p and smaller N, in the cases of interest in data mining the first component is often dominating. However, data communication costs are included in the constant τ 2 and depending on the ratio of communication to computation time (basically τ 2 /τ 1 in this case) the second component may be substantial. Where this is not the case, and if one partitions the data in a balanced way so that N r = N/p and introduces a parameter κ = k/max r k r then one gets a speedup of j=0 S p = κ p p. So in order to determine the speedup one now needs to find κ p. If κ p is bounded from below for fixed N/p and growing p then the algorithm is scalable. Clearly the value of κ p depends on the actual partitioning of the data. We defer this discussion a bit and first consider another question: How does the resulting additive model compare to the additive model obtained from the backfitting algorithm. Again this would have to depend on the way the data is partitioned but also on the weights w r. In a very simple case it turns out that the iterates of the method presented above can also be generated by a smoothing procedure very similar to backfitting. This is the case of p times repeated observations (the same x values but arbitrary y values for all partitions) which are distributed over the p processors. We formulate this simple observation as a proposition (the index j is always assumed to be j = 1,..., d): Proposition 1. Let x j = [x [1] j f [r] 0 and f [r] j,..., x[p] j ] and y = [y[1],..., y [p] ]. Furthermore let be the components of an additive model which have been fitted to the partial data x [r] j and y such that equations (1) hold for these data points. Now let w r 0 with p r=1 w r = 1 and set f 0 = p r=1 f [r] 0, and f j = p r=1 f [r] j. If x [r] j = x for all r, and the smoothers for the partial components are linear with matrix S j and f j = f j (x) then f j = S j y f 0 i j f i

16 16 MARKUS HEGLAND with w 1 S j w r S j S j =... w 1 S j... w r S j Proof. The vector f j consists of p identical blocks f j (x ). We need to show that p f j (x ) = w r S j y [r] f 0 f i (x ). r=1 i j By expanding the right hand side using p r=1 w r = 1 and the definitions of f 0 and f j one gets: ( p r=1 w rs j y [r] f 0 ) i j f i(x ) ( p = S j r=1 w ry [r] f 0 ) i j f i(x ) ( p = S j r=1 w ry [r] p r=1 w rf [r] 0 ) p i j r=1 w rf [r] i (x ) ( p = S j r=1 w r(y [r] f [r] 0 ) i j f [r] i (x ). Now one can use property of equation (2) for the components to see that this is equal to p w r f [r] j (x [r] ) = f j (x ). r=1 So the parallel algorithm satisfies the equations (2) and thus the same methods for the analysis of convergence etc can be applied for this method. The underlying onedimensional smoother is quite intuitive. Many actual smoothers will have the property that S j determined as weighted average of partial smoothers with w r = 1/p is exactly the same as the smoother applied to the full data set. In this case the sequential and the parallel methods will give the same results. One can also see that the eigenvalues of the matrices S j and the ones of partial matrices Sj are the same in this case and thus the k r = k and thus κ p = 1 up to small variations which can depend on the data vectors y [r]. (The sequential backfitting turns out to be the backfitting for a problem with a matrix Sj and data r w ry [r].) In the more general case where the data partitions are all of the same size and have been randomly sampled (without replacement) from the original data the parallel algorithm is an instance of a bagging without replacement, see [11, 23, 14] for further analysis. Basically, the parallel algorithm gives results very close to the results of the sequential algorithm. There might even be some advantages of choosing the parallel backfitting algorithm over the sequential one due to higher variance reduction especially for nonlinear estimators. Parallel implementation of bagging has been suggested in [53] where it is also pointed out that the parallel algorithm of the kind discussed above may have superior statistical properties. We observe that the parallel backfitting procedure can be easily modified to provide a more general parallel bagging procedure. Besides bagging and parallel additive models other areas in data mining use the same idea of partitioning the data,

17 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 17 applying the underlying method to all the parts and combining the results. An example where this is done in association rule mining can be found in [52] Implementation the Block and Merge Method. An implementation of a variant of the algorithm discussed above has been suggested and discussed in [39]. The method was tested on a SUN E4000 with 8 SPARC processors with 167 MHz each and 4.75 GBytes of shared memory. The application considered was the estimation of risk for a large car insurance database. The implementation used a master-slave approach, was implemented in C++ using the MPI message passing package for data exchange between processors. While this is not necessary for shared memory processors the overhead of using MPI is small and this makes the code much more portable. Tests showed that this approach is scalable in the data size, and in the number of processors and achieves good speedups for the small numbers of processors considered. However, for larger numbers of processors and distributed computing this approach has not been studied. In this implementation the data is partitioned into m blocks where m is typically larger than the number of processors available. The partitions are random and all of equal size. The size of one block contains N/m records is chosen such that it fits into memory which provides a performance advantage for the backfitting algorithm as no swapping is required. It was found that choosing a small smoothing parameter for the one-dimensional smoothers has an advantage. This reduces the bias of the method at the cost of increased variance. Merging the models from the m partial data sets may involve averaging but may actually also include some smoothing. Additional smoothing may be done particularly after the last merge in order to control the tradeoff between bias and variance. The merging stages however, were not very costly and consequently were done on one processor sequentially. The smoothing parameter in the initial stages of the algorithm is now not only used to tradeoff bias and variance. In addition it controls the number of iterations and resolves numerical difficulties due to uneven data distribution. Only at the final combination step the smoothing parameter is used exclusively to balance bias and variance. For very large data sizes continuous data was discretised (binned) onto a regular grid. This allows effective data compression and leads to banded systems of equations if smoothing splines on equidistant grids are used. Note that binning can have an effect of substantially reducing the data size and thus increase computational performance but it may also lead to load imbalance when the data sizes on the different processors are reduced differently. 3. Thin Plate Splines 3.1. Fitting Radial Basis Functions. Radial basis function approximations are real functions f(x) with x R d of the form (3) f(x) = N c i ρ( x x (i) ) + p(x), i=1 where p is a polynomial typically of low degree and is even zero in many cases. Radial basis functions have recently attracted a lot of attention as they can efficiently approximate smooth functions of many variables. Examples for the function

18 18 MARKUS HEGLAND ρ include Gaussians ρ(r) = exp( αr 2 ), powers and thin plate splines ρ(r) = r β and ρ(r) = r β ln(r) (for even integers β only), multiquadrics ρ(r) = (r 2 + c 2 ) β/2 and others. Radial basis functions have been generalised to metric spaces where the argument of ρ is the distance d(x, x (i) ) (and the polynomial p = 0). As always let x (i) be the data points of the predictor variables. The vector of coefficients of the radial basis functions u = (u 1,..., u N ) and the vector of coefficients of the polynomial term v = (v 1,..., v m ) are determined, in the case of smoothing, by a linear system of equations of the form: (4) [ A + αi P P T 0 ] [ u v ] = [ y 0]. The block A has elements ρ( x (i) x (j) ), the elements of the matrix P are p j (x (i) ) where the p j are the basis functions for the polynomials, the matrix I is the identity and α is the smoothing parameter. The vector y contains the observed values y (i) of the response variable. Existence, uniqueness and approximation properties, in particular for the case of interpolation (α = 0) have been well studied. Reviews of radial basis function research can be found in [22, 47, 13]. The main computational task in the determination of a fit with radial basis functions is the solution of the system of equations (4). The degree of the polynomial term p is m = O(d s ) for some small s (m = d + 1 in the case of linear polynomials) and N is very large. Thus most elements of the matrix of the linear system of equations are in the block A. While the radial basis functions ρ may be nonzero far from the origin the effects of the combination of the functions cancels out, a consequence of the equation P T c = 0. In spite of this locality the matrix A is usually mostly dense. In high dimensions this may even be true when the radial basis function ρ have a small support {x ρ(x) 0}. The reason for this is the concentration effect which is observed in high dimensions. One finds that the distribution of pairwise distances becomes very narrow so that all randomly chosen pairs of points have close to the same distance. As the support of the radial basis function has to be chosen so that for every x (i) there are at least some x j) for which it follows from the concentration and the smoothness of ρ that for any i one gets ρ( x (i) x (j) ) 0 for most other j. It follows that A is fundamentally a dense matrix. When solving the system (4) one could either use direct or iterative solvers. Due to the density of the matrix the direct solver will have complexity O(n 3 ) and an iterative solver at least complexity O(n 2 ) as it requires more than one matrix vector product. Efficient parallel algorithms for both cases are available and have been widely discussed in the literature, see, e.g. [9] for software and further references. The O(n 2 ) or O(n 3 ) complexity however implies non-scalability and thus very large problems are not feasible for this approach. Similar to the approach discussed in Section 2 one may use a bagging approach here as well and partition the data, fit a partial model to every part and average over the partial models. While this is a valid approach we will now consider a different approach Parallel and Scalable Finite Element Approximation of Thin Plate Splines. The radial basis function smoother can also be characterised variationally. In the case of thin plate splines [19] the radial basis functions are Greens functions of a partial differential equation and the radial basis smoother is the smoothing

19 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 19 spline which minimises (5) M α (f) = 1 n (f(x (i) ) y i ) 2 + α n i=1 Ω ( 2 ) 2 ( f(x) 2 ) 2 ( f(x) 2 ) 2 f(x) x 1 x 1 x 2 2 dx x 2 in the case of d = 2. The intuition behind this is that the parameter α allows to control the stiffness of the spline and thus the curvature of the fit. Small α will lead to a closer fit of the data where large α will lead to a smoother fit. For our computations we will consider the case of a compact domain Ω rather than Ω = R 2. The domain Ω is either R 2 or a compact subset. Instead of using the radial basis functions (for the case Ω = R 2 ) and equation (4) with linear polynomials one can develop a good approximation to this problem using numerical methods for the solution of elliptic partial differential equations, in particular finite element methods. This approach transforms the smoothing problem into a penalised finite element least squares fit. Finite element approximations have well established approximation properties, e.g., [17, 10]. The main results indicate that generic finite element approximations lead to good approximations for smooth enough functions and regular domains. We will only discuss the computational aspects further here. Fundamentally, the finite element method approximates the function f with a function from a finite dimensional linear space, in particular a space of piecewise polynomial functions. This approximation is written as a linear combination of basis functions which typically have a compact (small) support (the set of x where f(x) 0). Thus we get a representation f(x) = M u j B j (x) j=1 with basis functions B j and coefficients u j. The degrees of freedom of the approximating space is denoted by m and is typically much smaller than N. We now are looking for a function f of this form which minimises the functional M α in (5). The vector of coefficients u = [u 1,..., u m ] which minimise the quadratic function u T Mu 2b T u + 1 N yt y + αu T Su for some matrices M and S and a vector u the determination of which will be discussed later. The minimum of this function satisfies the equations (M + αs)u = b. Thus the fitting problem has now been reduced to two problems, the first one being the determination of M, S and b and the second one the solution of this system. We will now discuss both steps and see how parallel processing can substantially improve performance but maybe most importantly how the scalability problem of the original radial basis function approach has suddenly disappeared! These two steps can be found in any finite element solver. The first step is called the assembly step and the second is called solution. From the variational problem one sees that a general element of the mass matrix M is [M] jk = 1 N B j (x (i) )B k (x (i) ), N i=1

20 20 MARKUS HEGLAND the right-hand side b has the components b k = 1 N y (i) B k (x (i) ) N and the elements of the stiffness matrix S are the integrals ( 2 ) ( B k (x) 2 ) ( B j (x) 2 ) ( B k (x) 2 ) ( B j (x) 2 B k (x) [S] kj = 2 x x 1 x 1 x 2 x 1 x 2 2 x 2 Ω i=1 ) ( 2 B j (x) The complexity of the assembly step depends very much on the support of the basis functions B j. We will assume here that the support is such that for every point x at most K basis functions have B j (x) 0. Thus the complexity of the determination of b is O(KN) and one can see that the complexity of the determination of M is O(K 2 N) as for every point x there are at most K 2 products B i (x)b j (x) 0. The complexity of the determination of S is of the order O(K 2 m) in nondegenerate cases. For large data sets the time to generate the matrix will dominate. Note that this step is scalable in the data size. Moreover, it only requires one scan of the full data set. By partitioning the data into p equal parts this step can also be ideally parallelised up to a reduction step. The complexity for this stage is thus approximately T a = K2 Nτ 1 + log 2 (p)ν(m)τ 2 p where ν(m) is the number of nonzero elements of M. Typically the first part is dominating and a speedup of close to p will be obtained. The complexity of the solution step depends on the method used. Both direct methods using, e.g., parallel multifrontal algorithms [1] and iterative method using, e.g., parallel multigrid algorithms [58] are applicable. The speedups obtained for these steps is not quite as good as for the assembly. This mostly has to do with increased communication requirements but also with the fact that (in particular in the case of direct solvers) parallel algorithms require extra operations compared with the may have higher computational demands than the corresponding sequential algorithm. This extra work required for parallel execution is sometimes called redundant A Parallel Nonconforming Finite Element Method. The finite element method delivered a parallel algorithm which is scalable in the data size and thus capable to handle very large fitting problems. However, in our discussion we have glossed over one important point, namely the effect of the second derivatives in the penalty term. They lead to forth order partial derivatives in the Euler equations and can have the make the linear systems of equations ill-conditioned and hard to solve, which, e.g., leads to slow convergence of the iterative solvers. But maybe computationally even more of a problem is the fact that this requires the basis functions to be much more smooth which in turn leads to large K and matrices A and S which are more densely populated. The higher smoothness requires also that one uses higher order polynomials which are more expensive to estimate. Of course many of this problems are not unique for the fitting problem, in fact, they are known to occur for general finite element problems for plates. This contrasts very much with the situation where the penalty is the norm of the gradient of f in which case the Euler include the Laplacian, and one can use 2 x 2 ) dx.