PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING

Size: px
Start display at page:

Download "PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING"

Transcription

1 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING MARKUS HEGLAND Abstract. Parallel computing enables the analysis of very large data sets using large collections of flexible models with many variables. The computational methods are based on ideas from computational linear algebra and can draw on the extensive research on parallel algorithms in this area. Many algorithms for the direct and iterative solution of penalised least squares problems and for updating can be applied. Both methods for dense and sparse problems are applicable. An important property of the algorithms is their scalability, i.e., their ability to solve larger problems in the same time using hardware which grows linearly with the problem size. While in most cases large granularity parallelism is to be preferred, it turns out that even smaller granularity parallelism can be exploited effectively in the problems considered. The development is illustrated by four examples of nonparametric regression techniques. In a first example, additive models are considered. While the backfitting method contains dependencies which inhibit parallel execution it turns out that parallelisation over the data leads to a viable method, akin to the bagging algorithm without replacement which is known to have superior statistical properties in many cases. The second example considers radial basis function fitting with thin plate splines. Here the direct approach turns out to be non-scalable but an approximation with finite elements is shown to be scalable and parallelises well. One of the most popular algorithms in data mining is MARS (Multivariate Adaptive Regression Splines). This is discussed in the third example. MARS has been modified to use a multiscale approach and a parallel algorithm with a small granularity has been seen to give good results. The final example considers the current research area of sparse grids. Sparse grids take up many ideas from the previous examples and, in fact, can be considered as a generalisation of MARS and additive models. They are naturally parallel when the combination technique is used. We discuss limitations and improvements of the combination technique. 1. Introduction 1.1. Data, Functions and Computers. The availability of increasing computational power has lead to new opportunities in data processing. This includes the computational ability to handle larger and growing data sets and fitting increasingly complex models. In this review we aim to illustrate this development by discussing several recently developed methods and algorithms. These developments have resulted from joint and independent efforts of several communities. These include statisticians with an interest in computational performance, numerical linear algebraists interested in data processing, the more recently emerging data mining community with interests in algorithms, the computational mathematics community interested in applications of approximation theory, the diverse parallel computing community, machine learners with a computational orientation interested in very large scale problems, and many computational scientists especially the ones interested in the computational solution of partial differential equations. Many other 1

2 2 MARKUS HEGLAND communities which I have not mentioned here have contributed to this exciting development as well. I hope the following discussion will be of interest to members of many of these communities. The computational techniques discussed in the following have originated at various places and the parallel algorithms covered have all been the topic of extensive research conducted at the Australian National University. For anyone interested in trying out these methods there are many data sets available which allow investigating the performance of computational techniques [7, 21, 44]. While some of these data sets are not large by todays means they are still useful for scalability studies. Software is available for many of the methods discussed, and while well established companies have constantly been able to improve their standard software, maybe some of the most exciting developments in the area of algorithms and large scale high performance computing software has been happening in the open source community driven by publicly funded university research. This review does not review statistical issues. Even though statistical issues are extremely important, they are covered in many excellent statistical texts, including the recently published [35]. Here we will focus on computational aspects and the algorithms, their computational capability to handle very large data sets and complex models and the way in which they harness the performance of multiprocessor systems. Predictive modelling aims at finding functional patterns in data. More specifically, given two sequences x (1),..., x (N), and y (1),..., y (N) defining the data, predictive modelling aims to find functions such that y = f(x) y (i) f(x (i) ). While the type of the predictor x can be fairly arbitrary, including real and binary vectors, text, images, and time series, we will focus in the following on the case where x is an array containing categorical (discrete) and real (continuous) components. The response y will be considered to be real, i.e., we will be discussing algorithms for regression. Many of the observations and algorithms can be generalised to other data types for both the predictors and the response. In particular, the algorithms discussed here are also used for classification, see, e.g. [35]. We will be concerned here with the time it takes to solve a particular regression problem. The time depends firstly on the data size, modelled by N above. The data available in data analysis and, in fact, the data which is actually used in individual studies has been growing exponentially. It has been suggested that this growth is akin to the growth in computational performance (according to Moore s law). GByte sized data sets are very common, TByte data sets are used in large studies and PByte data sets are emerging. Software and algorithm development, however, is done at a slower pace. This discrepancy between the speed of software development and the simultaneous growth of data sizes and computational power is not a problem if the software is able to adapt to new data sizes and hardware. This adaptability is referred to as scalability. Fundamentally scalability of algorithms and software means that the same algorithm can be used to deal with larger data sets if the computational power is increased accordingly.

3 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 3 There are two fundamentally different ways to increase computational throughput. One way is to increase the speed (clock cycle, memory and communication bandwidth, latencies) of the hardware and the other way is to use multiple (often identical) hardware components which work simultaneously. In order to make full usage of this hardware parallelism one needs special algorithms and this is the focus of the following discussion. Ideally, the throughput of any computation grows linearly with the amount of hardware available. The amount of disk space required grows linearly with the data size and the work required to read the data grows linearly as well. Thus the computational resources to analyse a data set at the least have to grow linearly in the data size N. Consequently O(N) is the optimal resource complexity and we will in the following assume that all the hardware resources available grow linearly with the data size. The O(N) hardware complexity has important implications for the feasible computational complexity of the algorithms. We can assume that the computations required are close to the limits of what is feasible. One would typically not allow for an increase in computational time even the data sets grow. However, if we allow the number of processing units to grow linearly in the size of the data that means that in the ideal case the time to process the data stays constant if the algorithms used have O(N) complexity. Again this computational complexity is optimal as one at least needs to scan all the data once. This linear complexity of the algorithm is called scalability with respect to the data size. Besides the data size another determining factor for performance is model complexity. This complexity relates to the number of components of the predictor variable vector x, and the dimension of the function space. While in earlier parametric linear models there were just a handful of variables and parameters to determine, modern nonparametric models and, in particular, data mining applications consider thousands up to millions of parameters. Finally, the function is usually drawn from a large collection of possible candidate functions. While one was interested earlier in just the best of all models, in data mining applications one would like to find all models which are interesting and supported by the data. Thus the size of the candidate set is a third determining factor of performance. The algorithms considered here are characterised by the function class. A simple way to generate spaces of functions with many variables is to start with spaces of functions of one variable and consider the tensor product of these spaces. In terms of the basis functions, for example, if V i are spaces of functions of the form f(x i ) = j u i,jb i,j (x i ) then the functions in the tensor product d i=1 V i consists of functions of the form f(x) = j 1,...,j d u j1,...,j d b 1,j1 (x 1 ) b d,jd (x d ). If (for simplicity) we assume that each component space V i has m basis functions b i,j (x i ) then the tensor product space has m d basis functions. As the basis is local we expect that this expansion will contain essentially all basis functions for most functions of interest. For example, if tensor product hat functions are used for a basis, then one has for the constant function f(x) = 1 the expansion: 1 = j 1,...,j d b 1,j1 (x 1 ) b d,jd (x d ).

4 4 MARKUS HEGLAND This problem has been called the curse of dimensionality and it occurs in many other situations as well. In the following we will consider function classes which, while often subsets of the tensor product space, use a different basis and are able to deal effectively with the curse of dimensionality. In order to find interesting functions one needs to measure success, using a loss function, error criterion or cost function. This cost function has also a tremendous influence on the development of the algorithm and the performance. For regression problems the simplest loss function used is the sum of squared residuals N i=1 ( f(x (i) ) y (i)) 2. This is the cost function we will mainly use here. Many other measures are used including penalised sum of squares, and median absolute deviation. As always, one should be very careful when interpreting this function, in particular if the function has been determined by minimising this sum. The sum of squares, or so-called resubstitution error is in this case a very poor measure for the actual error of the fitted function as it will seriously underestimate the actual error. More reliable measures have been developed including hold-out data sets, crossvalidation and bootstrap. The data is often thought of as drawn from an underlying theoretical data distribution. If we denote the expectation with respect to this distribution by E( ) the expected mean squared residual is E ((f(x) Y ) 2) where X and Y are the random variables observed in the data. The best function f would be the one which minimises this expectation. This theoretical best function is denoted as a conditional expectation: f(x) = E(Y X = x). The algorithms we discuss here all aim to approximate this best function. Parallelism is a major contributor to computational performance. Parallelism is exploited by splitting the instruction and data streams into several independent branches. Hardware supporting this parallelism ranges from different stages of arithmetic pipelines, multiple memory paths, different CPUs with a shared memory, multiple independent processors with their own memories each, collections of workstations and grids of supercomputers. In general, any computation will involve parallelism at many different levels. Slightly simplified, one has p streams of operations which run in parallel instead of having to run sequentially. A fundamental characteristic of parallel execution is parallel speedup which is defined as the ratio of the time for sequential execution to parallel execution S p = T sequential T parallel. Ideally, if one has p streams running in parallel the speedup is p. Thus one introduces parallel efficiency as the ratio of actual speedup to the number of streams or parallel processes: E p = S p p.

5 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 5 Of course one cannot do everything in parallel, some tasks require that other ones are completed first. This dependency of different tasks is often present in iterative methods which determine a next step based on previous ones. In simple cases this dependency can be overcome. For example the determination of a sum N i=1 x(i) can either be done with a recursion s 0 = 0, s i+1 = s i + x (i) or by determining first the sums of each consecutive data points to get a new sequence then repeat this until finished: z i 1,0 = x (i), z i,k+1 = z 2i,k + z 2i+1,k. This recursive doubling algorithm does increase the amount of parallelism considerably especially for large data sizes N. The number of operations has not increased but the amount of parallelism is limited. Rather than completing at a time of approximately (N 1)/pτ this method would require ( ) N p T parallel = + log p 2 (p) τ where τ is the time for one addition. Thus the efficiency in this case is 1 E p =. 1 + plog 2(p) p+1 N 1 Note that this situation is not uncommon in parallel processing, another simple case is reflected in Amdahl s law where only a proportion of the total computation can be done in parallel. If, say, r is the ratio of work which can only be done sequentially then one gets at best an efficiency of 1 E p = 1 + (p 1)r. The speedup is limited to 1/r and so the efficiency drops to 0 for very large p. As every computation will have some proportion which cannot be shared among other processors this is a fundamental limitation to the amount of speedup available an thus to the computational time. This may be overcome by considering larger problems which have have a smaller proportion r of sequential tasks. A second example of dependency which can be overcome is the solution of a linear recursion: z 0 = a, z i+1 = b i z i + c i. This can be done by interpreting the recursion as a linear system of equations which can then be solved with a parallel algorithm (we will talk about this more later). However, this comes at a cost and it turns out that in this case the time required is roughly doubled. These extra operations which are required to make the algorithm parallel are called parallel redundancy. This situation is again very common and for an algorithm where the amount of work is increased by a factor of r will at most be able to achieve a parallel efficiency of E p 1 r. Note that while the efficiency is bounded at least it does not go to zero like in the previous case. If enough hardware parallelism p is available this may still be a very interesting option as there is no bound on the smallest amount of time required.

6 6 MARKUS HEGLAND For both cases of overcoming dependencies the order of the operations and even the nature of the operations where changed. While this would typically give the same results if exact arithmetic is used due mainly to the fact that the associative law of addition this is not the case for floating point arithmetic. A careful analysis of the parallel algorithms is required to guarantee that the results of the computations are still valid. This is in particular the case for the second case where the parallel algorithm can even break down! There are, however, often stable parallel versions of the algorithms available, see [43] for some examples which generalise the second example. While in principle an efficiency higher than one is not possible there are cases where the parallel execution can run more than p times faster on p processors than on one processor. This happens for example when the data does not fit in the memory of one processor but does fits in the memory of p processors. In this case one requires the loading of the data from disk during the execution on one processor, say, in p chunks. If one counts the time which it takes to load the data from disk into memory in the parallel execution as well one can confirm that in this case E p 1 as well. However, this example points to an important aspect of parallel processing: The increase in available storage space. Not only are the computational processing capabilities increased in a parallel processor but also the memory, caches and registers. This increase in storage can have a dramatic effect on processing times. Maybe the main inhibitor of parallel efficiency is problem size. Obviously one needs at least p tasks to get a speedup of p. However, smaller problem sizes do not require the application of parallel processing. Often one would like to consider various problems with different problem sizes and one is interested in maintaining parallel efficiency over all problems. The question is thus can we maintain parallel efficiency for larger numbers of processors if we at the same time increase the problem size. One chooses the problem size as a function of the amount of parallelism p. This is called parallel scalability. In the case of the determination of the sum of N numbers one gets scalability if N(p) = O(p log 2 (p)) and many other cases are scalable. Note that the case of Amdahl s law does give a hard limit to scalability, so in this case no problem size will help to maintain efficiency as speedup is limited to 1/r if r is the amount of sequential processing required. When mapping a collection of tasks onto multiple resources a speedup of p can only be obtained if all the tasks take the same time the time for the parallel execution is equal to the time required for the longest task. This is the problem of load-balancing the work. Thus one has to make sure that the size of the tasks are all about the same. One way to achieve this is to have very many tasks and none extremely large. The tasks can either be preallocated based on some estimate or, in many cases more efficiently, be allocated in real time using a master-worker model. A typical parallel computing task will consist of periods where several independent subtasks are processed in parallel and other periods where information is exchanged between the processors. The average volume of work which is done between the communication periods determines the granularity of the task. In the case of small granularity the communication overhead or latency will strongly influence the time and so one would aim to do larger tasks between data interchange steps. In addition to the communication overhead one finds that some parts of

7 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 7 the communication are independent of the granularity, thus again larger granularity would show better overall performance due to a reduction in communication volume. Finer granularity is used at the level of pipelines, vector instructions and SIMD (single instruction multiple data) processing where larger granularity is usually preferred for collections of workstations and MIMD (multiple instruction multiple data) processing. The effect of the hardware on the time to execute an algorithm is mainly determined by the access times of various storage devices, including registers, several levels of cache, memory, secondary storage, distributed storage in a cluster and distributed storage across the Internet. These memory hierarchies are defined by access speeds. The fastest computations are the ones which rely for their data movements almost entirely on the faster memories and only occasionally have to access the slower memories. The reason for the existence of these hierarchies is because of the enormous price difference in the various types of memories. The amounts of slower memory available on any computing system will dwarf the amounts of fast memory available. Any algorithm which requires frequent disk access will be much slower than a similar algorithm which mostly accesses registers. An example for slow data access is the data transfer between different processors without shared memory. Here the programmer directly has to deal with the the data movements. The transfer of data between different processors is called communication and the most popular way to implement this is using message passing. While the message passing routines do include various complex patterns of data interchange they all rely on the two primitives send where one processor makes data available to another processor and receive where the a processor accepts data from another processor. Thus any data transfer requires two procedure calls on two different processors. This provides a simple means of synchronisation. Software using message passing are even popular on shared memory processors due to their portability and the simple synchronisation mechanism. (Usually there is only a small time penalty through the application of message passing.) The time for message passing is modelled to be proportional to the data size plus some latency as: t communication = τ + βn where n is the amount of data transferred (typically in MByte), β is the communication bandwidth and τ the latency. As mentioned earlier, it is important to communicate data in large batches to avoid the penalty occurred through the latency. Another attractive feature of message passing is that there are standard message passing libraries available, in particular MPI [31], see edu.au/~ole/pypar for a binding in Python. Message passing is used for tasks which are distributed among processors in a multiprocessor system or a collection of workstations. In the case of a grid containing several multiprocessor systems even larger granularity is required. On this level data communication would be done using network data transfer commands (like scp and rsync ) and the tasks should have very large granularity. Tools assisting with the access of computational grids can be found in the Globus toolkit Running the jobs in batch mode on a collection of multiprocessors is appropriate for this large granularity, for an example of an open source implementation using an R language frontend, see

8 8 MARKUS HEGLAND 1.2. Concepts from Numerical Linear Algebra. We consider the problem of fitting data sets using functions from linear function spaces by (penalised) least squares methods. This problem has two computational components, the search for a good approximating subspace and the solution of a (penalised) least squares problem. We will review some of the developments and ideas from parallel numerical linear algebra which are relevant to the following discussion. Numerical linear algebra is one of the main areas of parallel algorithm development. A cumulation of this work was the the release of the LAPACK and ScaLAPACK software [2, 9]. The solution techniques can be grouped into two major classes: Iterative methods and direct methods. These methods are applied either to the residual equations Ax b = r or to the corresponding normal equations A T Ax = A T b. For the normal equations iterative methods require the determination of repeated matrix vector products y = Mx with the same matrix M = A T A but with varying x. In the case of dense n n matrices M this product requires O(n 2 ) arithmetic operations. All the n 2 real multiplications required for this could be done in parallel in one step but the additions require at least log 2 (n) 1 steps of recursive doubling. Thus the matrix vector product could be done in log 2 (n) steps if we neglect communication and have n 2 processors. However, in the cases considered here n is large so that it is unrealistic to assume the availability of O(n 2 ) processors. For realistic parallel algorithms, the matrix is partitioned into blocks: M = M 1,1 M 1,t.... M s,1 M s,t and with according blocking of the vector x the matrix vector product has the form y k = t M k,j x j for k = 1,..., s. j=1 So now one uses st processors for the parallel multiplication step and successively less for the collective log 2 (t) summation steps. Special variants which have been considered are the cases of t = s, and t = 1 and s = 1. The variants have slightly different characteristics. For example, in the case of t = 1 no reduction step and loss of parallelism occurs. However, the whole vector x needs to be broadcast at every step. There is no need for broadcast in the case of s = 1 but the reduction operation is more dominant in this case. An analysis shows that the communication costs are similar for all cases. The choice of st (which can be larger than p influences the performance as well. See [45, 59] for a further discussion of these issues. The cases of s = 1 and t = 1 are especially of interest for sparse matrices as this provides better load balancing due to the fact that these cases do not require even distribution of the nonzero elements over the whole matrix but only over the rows or columns.

9 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 9 The effect of granularity has been considered in-depth in the research leading to the ScaLAPACK and LAPACK libraries on direct solvers. The underlying algorithms including Gaussian elimination, Cholesky factorisation and QR factorisation can be implemented as a sequence of SAXPY operations which combine two vectors (e.g., rows of a matrix): z = αx + y. where x and y are vectors and α R. Here all the products and additions can be done in parallel and the average amount of parallelism is O(n). This fine grain parallelism even in the case of vector processing suffers from the fact that there are 3 memory accesses per every 2 floating point operations. However, these operations have been fundamental in the further development, they form part of the level 1 BLAS subroutines. Now one can group n subsequent updates into one matrix update. This is the basis for matrix factorisations and the update step is here a rank one modification of a matrix: A = A + αxy T. Here the total amount of parallelism is the same as in the case of a matrix vector product and it turns out that the performance and granularity basically the same. This is the case of a level 2 BLAS routine and it allows efficient usage of vector registers. This is considered to be of intermediate granularity in numerical linear algebra. Note, however, that the amount of data to be accessed is proportional to the number of floating point operations performed (O(n 2 )). Highest granularity is obtained when matrix-matrix products are used. In the case of the solution of linear systems of equations this means that operations of the type A = A + BC 1 D occur. Such operations allow the effective usage of various caches and efficient parallel execution. Here one sees the volume to surface effect of parallel linear algebra, the number of operations performed here are O(n 3 ) while the amount of data accessed is O(n 2 ). This is one aspect of the large granularity of the computation. These computational building blocks have been optimised for many computational platforms architectures and implementations of methods which adapt to different architectures are available [61]. There are various approaches to the implementation on parallel architectures, see, e.g., [9, 56]. Granularity issues are important for iterative methods, and for direct methods both using the normal equations and the QR factorisation of the design matrix A. The approach using normal equations has two computational steps: (1) Determine the normal matrix M = A T A and A T b (2) Solve the normal equations A T Ax = A T b In our case these two steps have totally different characteristics. In the first step the data is accessed. This step is scalable with respect to the data size and parallelises well if the data is equally distributed as A 1 A 2 A =... A p

10 10 MARKUS HEGLAND The partial normal matrices M i = A T i A i are determined on each processor and then assembled using a recursive doubling method to give: p M = M i. i=1 This approach assumes that the complexity of the model is such that the matrix M can be held in memory of one processor. We will see examples where further structure of this matrix can be exploited to distribute it onto several processors. This scheme is very effective for very large data sets and large models. The solution step can now be parallelised as well in order to deal with the O(n 3 ) complexity. It is often pointed out that the QR factorisation approach is more stable and will give better results than the normal equation approach. It turns out that this approach can also be performed in two steps: (1) Determine the QR factorisation A = QR and Q T b (2) Solve Rx = Q T b Again only the first step is data dependent. It can be parallelised by partitioning the data as in the case above. Then QR factorisations are performed on each processor giving A i = Q i R i and Q T i b. The results of this, i.e, R i and Q T i b can now be exchanged and an overall QR factorisation is obtained using again a recursive doubling method. As the amount of parallelism used by the recursive doubling one can now utilise some of the internal parallelism of the factorisations by applying the BLAS level 3 approaches discussed above. So far we have considered the parallel solution of least squares problems. The algorithms considered here have another aspect as the actual model will be selected from many candidates. In order to compare several candidates one has to solve many different least squares problems. Often the algorithms used take the form of greedy algorithms where a certain model is increased by some extra degrees of freedom and various ways to increase a given model are compared. Thus one solves many problems with residual equations for fixed y and A x y = r A = ( A B ) for a given A and many different B. The normal matrix for this case is [ ] A T A A T B B T A B T. B Now the update requires a further scan of all the data to determine A T B and B T B which can be done in parallel like before. In addition one can do various updates for several possibilities B in parallel. The Cholesky factorisation can also be updated efficiently, see, e.g. [28]. Note, however, that the granularity of the update is substantially less than the granularity of the full factorisation. We have so far focused on the case of dense linear algebra. This is an important case and, in fact, many of the matrices occurring in predictive modelling are dense, e.g. when linear combinations of radial basis functions are fitted or for additive models. However, for very large scale computations the complexity of dense matrix

11 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 11 computations is not feasible, and, in fact, we will see how such dense matrices can be avoided. So one has to ask the question if the principles discussed above are still applicable in the sparse matrix case. It turns out that this is indeed the case. The application of dense linear algebra for the direct solution of least squares problems has been well established for the multifrontal methods [1]. These methods make use of the fact that during elimination procedure the original sparsity of the matrix tends to gradually disappear as more and more dense blocks appear. This is certainly the case for the simplest sparse matrices, the block bidiagonal matrices which have the form A 1 B 2 A 2 B 3 A The assembly stage is not fundamentally different from the dense case. In the factorisation the amount of parallelism seems now confined by the inherent dependency of the factorisation which proceeds in a frontal fashion along the diagonal. Each stage of the factorisation will consider a new block row with B i+1 and A i+1 and will do a dense factorisation on these blocks. Thus the parallelism in this factorisation can be readily exploited. In addition, one can see that by an odd-even permutation (i.e. take first all the odd numbered block rows and columns and then the even numbered ones) one gets a block system with an extra degree of parallelism. This approach is very similar to the odd-even reduction approach to summation. However, it turns out that the amount of floating point operations in this approach is higher than in the original factorisation. For more details on this, see [43, 3, 40]. In the case of more general matrices one can observe very similar effects. However, instead of having a front and dense factorisations with a constant size one now has multiple fronts which subsequently combine to form ever larger dense matrices. In addition to the dense parallelism one has also the parallelism across various fronts which is of the same nature as the parallelism for the block bidiagonal case. There is also the potential of trading of some parallelism with redundant computations. See the [1, 20] for more details. These multifrontal methods have originated in the solution finite of element problems. For finite element problems one has also two stages very similar to the two stages (forming of normal equations and solution) for the least squares problems. It turns out that, like in the case of least squares problems the assembly can be viewed as an elimination step for the augmented system of equations. The assembly and the factorisation of the matrix can be combined to one stage like in the case of the QR factorisation.. 2. Additive Models 2.1. The Backfitting Algorithm. Additive models are functions of the form f(x 1,..., x d ) = f 0 + d f j (x j ). For this subclass the curse of dimensionality is not present as only O(d) storage space is required. The constant component f 0 is obtained as an average of the j=1

12 12 MARKUS HEGLAND values of the response variable: f 0 = 1 N N y (i). For the estimation of the f j one-dimensional smoothers can be used. A smoothing parameter controls the tradeoff between bias and variance. The choice of the value of the smoothing parameter is essential and has been extensively discussed in the literature, see [35, 32] for an introduction and further pointers. Using a smoother S j the f j are defined as the solution of the equations i=1 (1) f j = S j (y f 0 i j f i x j ) where f j are the vectors of values of f j at the data points x (i) j, i = 1,..., N or f j = f j (x j ). Uniqueness of the solution is obtained through orthogonality conditions on the f j. This choice of the f j is motivated by the fact that the the functions which minimise expected squared residuals satisfy f e j (x j ) = E(Y f e 0 i j f e i (X i ) X j = x j ) and the equations are obtained by replacing the expectation by the smoother. Alternatively, the fixpoint equations are obtained from a smoothing problem for additive models [34, 60]. The simple form of these equations where in the j th equation the variable x j only occurs explicitly on the left hand side suggest the usage of a Gauss-Seidel or SOR process [46] for their solution. More specifically, for the Gauss-Seidel case a sequence of component functions fj k is produced with The iteration step is then f k+1 j f 0 j = S j (y f 0 x j ). = S j (y f 0 i<j f k+1 i i>j f k i x j ) where fj k = f j k(x j). This is the backfitting algorithm, see [34, 33, 35] for more details and also a discussion of the the local scoring algorithm which is used for generalised additive models. In case where the smoother S j is linear one introduces matrices S j (which depend on x j ) such that f j = f j (x j ) = S j (y x j )(x j ) = S j y. As a consequence of the fixpoint equations (1) one obtains a linear system of equations for the function value vectors f j : I S 1 S 1 f 1 S 1 (y f 0 ) S 2 I S 2 f 2.. (2) = S 2 (y f 0 )... S d S d I S d (y f 0 ) The backfitting procedure leads to an iterative characterisation of the f k i : f d f 0 j = S j (y f 0 )

13 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 13 and f k+1 j = S j (y f 0 i<j f k+1 i i>j f k i ). This is formally a block-gauss-seidel procedure for the component vectors and the established convergence theory for these methods can be applied. In particular, convergence is obtained if the matrices S j have eigenvalues in (0, 1). In fact, it turns out that in many practical applications backfitting converges in a couple of iterations. Note, however, that typically the block matrix is singular so that standard convergence theorems do not apply directly but require a simple modification. In general, the backfitting algorithm produces a sequence of functions f 1 1,..., f 1 d, f 2 1,..., f 2 d, f 3 1,.... Each element of the sequence depends on the d immediately preceding elements. For the determination of one element one smoother S j is evaluated. Thus the complexity of the determination of one element in this sequence is O(N). For the function d one requires d consecutive elements f j. The complexity of the determination of f1 k,..., fd k is thus O(kdN). Due to the dependency of the elements in the sequence one cannot determine multiple elements of the sequence simultaneously. A parallel implementation of the backfitting procedure would thus have to exploit any parallelism which is inherent in the smoother S j and in the evaluation of y f 0 i<j f k+1 i i>j f i k. While this is possible we will in the following discuss a different approach which can be viewed as an alternative to the ordinary backfitting. There are other alternatives to the parallel algorithm discussed here. One simple alternative is based on a Jacobi or fixpoint iteration. For this method the sequence of functions is defined by f k+1 j = S j (y f 0 i j f k i x j ). This substantially decreases the convergence rate but allows the simultaneous determination of d functions f k+1 j. One can improve convergence by accelerating this procedure with a conjugate gradient method. There are, however, additional problems with this approach. First the approach leads to a parallelism of at most degree d. But a speedup of d may not be possible due to inbalanced loads as it is unlikely that the time for all smoothers S j is the same. This is true in particular if one includes binary and categorical predictor variables. Load balancing would also be typically lost if one includes interaction terms or functions f ij (x i, x j ), inclusions which do not require major modifications of the backfitting method but which may improve predictive performance. In the case of lower levels of parallelism p and large d on can get some load balancing using a master-worker approach where each processor can demand another component once it has completed earlier work. Such a dynamic load-balancing approach is well suited to a distributed computing environment. However, maybe the main problem with this approach of parallelising over component functions f j is that each processor will require reading a large proportion of all the data. Thus the memory available limits the feasibility of this approach. An out-of-core method may be considered but this can introduce substantial overheads as the data needs to be transfered between secondary store and memory as each iteration needs to visit all the data. Thus for the data intensive iteration steps of backfitting one

14 14 MARKUS HEGLAND would prefer to have all the data points in memory. A difficulty of the parallel Jacobi method is also that at each step of the iteration data communication needs to be performed to move the new f i (or at least their sum) to all the processors. Further parallel iterative methods can be developed by applying known methods from linear algebra [51] to the block linear system of equations 2. An issue one has to deal with is the singularity of the block matrix as most iterative methods are defined for nonsingular matrices A Parallel Backfitting Algorithm. A parallel backfitting algorithm is required for time-critical applications, when one deals with very large data sets and for the case of slow convergence. Here we discuss an algorithm which is based on data partitioning. The proposed method has applications outside of parallel processing as well as it can be used for the case where the data is distributed over multiple repositories and for the case where not all the data fits into main memory. Data mining applications commonly do require processing of very large data sets, which may be distributed, will not fit in the memory, and have many variables. Thus data mining predictive modelling problems are applications which can ideally profit from the parallel algorithms discussed here. The algorithm is based on a partition of the data into p subsets. The elements of subset r are denoted by x [r] 1,..., x[r] d, y[r] where r = 1,..., p. Using the same smoothers for each subset and the backfitting algorithm one obtains p partial models f [r] j for j = 0,..., d and r = 1,..., p. The final model is obtained as a weighted sum f j = p r=1 w r f [r] j for some weights w r. The weights will characterise the influence of each partial component on the overall model. They should satisfy w r 0 and p w r = 1. r=1 They might include some measure of reliability of each estimate, for example they could just be N r /N where N r is the number of data points in subset r. One can also iteratively refine this method, first fitting the data in this way, then the residuals etc. If the subset r requires k r iterations to find the partial model the time for this subset is O(dk r N r ), say dk r N r τ 1. If all the subsets are processed in parallel the total time for this step is then T 1 = d max k r N r τ 1. r The determination of the sum is a reduction operation. If we assume that the subsets are distributed one can do this reduction in log 2 (p) steps by first adding the even elements to the preceding ones to get an array of models of size p/2 then repeating this step until only one model is left. As all these models are of the same

15 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 15 size the time to do this step is T 2 = log 2 (p) d m j τ 2 for some constant τ 2 and where m j is the complexity of the model f j. Note that in the case of a shared memory one could make use of the higher parallelism by adding the components in parallel and one would get for this case (if enough processors are available) T 2 = log 2 (p) max j m j τ 2. The speedup obtained through this algorithm is dknτ 1 S p = d max r k r N r τ 1 + log 2 (p) d j=0 m. jτ 2 While the summing up part (T 2 ) is important for very large p and smaller N, in the cases of interest in data mining the first component is often dominating. However, data communication costs are included in the constant τ 2 and depending on the ratio of communication to computation time (basically τ 2 /τ 1 in this case) the second component may be substantial. Where this is not the case, and if one partitions the data in a balanced way so that N r = N/p and introduces a parameter κ = k/max r k r then one gets a speedup of j=0 S p = κ p p. So in order to determine the speedup one now needs to find κ p. If κ p is bounded from below for fixed N/p and growing p then the algorithm is scalable. Clearly the value of κ p depends on the actual partitioning of the data. We defer this discussion a bit and first consider another question: How does the resulting additive model compare to the additive model obtained from the backfitting algorithm. Again this would have to depend on the way the data is partitioned but also on the weights w r. In a very simple case it turns out that the iterates of the method presented above can also be generated by a smoothing procedure very similar to backfitting. This is the case of p times repeated observations (the same x values but arbitrary y values for all partitions) which are distributed over the p processors. We formulate this simple observation as a proposition (the index j is always assumed to be j = 1,..., d): Proposition 1. Let x j = [x [1] j f [r] 0 and f [r] j,..., x[p] j ] and y = [y[1],..., y [p] ]. Furthermore let be the components of an additive model which have been fitted to the partial data x [r] j and y such that equations (1) hold for these data points. Now let w r 0 with p r=1 w r = 1 and set f 0 = p r=1 f [r] 0, and f j = p r=1 f [r] j. If x [r] j = x for all r, and the smoothers for the partial components are linear with matrix S j and f j = f j (x) then f j = S j y f 0 i j f i

16 16 MARKUS HEGLAND with w 1 S j w r S j S j =... w 1 S j... w r S j Proof. The vector f j consists of p identical blocks f j (x ). We need to show that p f j (x ) = w r S j y [r] f 0 f i (x ). r=1 i j By expanding the right hand side using p r=1 w r = 1 and the definitions of f 0 and f j one gets: ( p r=1 w rs j y [r] f 0 ) i j f i(x ) ( p = S j r=1 w ry [r] f 0 ) i j f i(x ) ( p = S j r=1 w ry [r] p r=1 w rf [r] 0 ) p i j r=1 w rf [r] i (x ) ( p = S j r=1 w r(y [r] f [r] 0 ) i j f [r] i (x ). Now one can use property of equation (2) for the components to see that this is equal to p w r f [r] j (x [r] ) = f j (x ). r=1 So the parallel algorithm satisfies the equations (2) and thus the same methods for the analysis of convergence etc can be applied for this method. The underlying onedimensional smoother is quite intuitive. Many actual smoothers will have the property that S j determined as weighted average of partial smoothers with w r = 1/p is exactly the same as the smoother applied to the full data set. In this case the sequential and the parallel methods will give the same results. One can also see that the eigenvalues of the matrices S j and the ones of partial matrices Sj are the same in this case and thus the k r = k and thus κ p = 1 up to small variations which can depend on the data vectors y [r]. (The sequential backfitting turns out to be the backfitting for a problem with a matrix Sj and data r w ry [r].) In the more general case where the data partitions are all of the same size and have been randomly sampled (without replacement) from the original data the parallel algorithm is an instance of a bagging without replacement, see [11, 23, 14] for further analysis. Basically, the parallel algorithm gives results very close to the results of the sequential algorithm. There might even be some advantages of choosing the parallel backfitting algorithm over the sequential one due to higher variance reduction especially for nonlinear estimators. Parallel implementation of bagging has been suggested in [53] where it is also pointed out that the parallel algorithm of the kind discussed above may have superior statistical properties. We observe that the parallel backfitting procedure can be easily modified to provide a more general parallel bagging procedure. Besides bagging and parallel additive models other areas in data mining use the same idea of partitioning the data,

17 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 17 applying the underlying method to all the parts and combining the results. An example where this is done in association rule mining can be found in [52] Implementation the Block and Merge Method. An implementation of a variant of the algorithm discussed above has been suggested and discussed in [39]. The method was tested on a SUN E4000 with 8 SPARC processors with 167 MHz each and 4.75 GBytes of shared memory. The application considered was the estimation of risk for a large car insurance database. The implementation used a master-slave approach, was implemented in C++ using the MPI message passing package for data exchange between processors. While this is not necessary for shared memory processors the overhead of using MPI is small and this makes the code much more portable. Tests showed that this approach is scalable in the data size, and in the number of processors and achieves good speedups for the small numbers of processors considered. However, for larger numbers of processors and distributed computing this approach has not been studied. In this implementation the data is partitioned into m blocks where m is typically larger than the number of processors available. The partitions are random and all of equal size. The size of one block contains N/m records is chosen such that it fits into memory which provides a performance advantage for the backfitting algorithm as no swapping is required. It was found that choosing a small smoothing parameter for the one-dimensional smoothers has an advantage. This reduces the bias of the method at the cost of increased variance. Merging the models from the m partial data sets may involve averaging but may actually also include some smoothing. Additional smoothing may be done particularly after the last merge in order to control the tradeoff between bias and variance. The merging stages however, were not very costly and consequently were done on one processor sequentially. The smoothing parameter in the initial stages of the algorithm is now not only used to tradeoff bias and variance. In addition it controls the number of iterations and resolves numerical difficulties due to uneven data distribution. Only at the final combination step the smoothing parameter is used exclusively to balance bias and variance. For very large data sizes continuous data was discretised (binned) onto a regular grid. This allows effective data compression and leads to banded systems of equations if smoothing splines on equidistant grids are used. Note that binning can have an effect of substantially reducing the data size and thus increase computational performance but it may also lead to load imbalance when the data sizes on the different processors are reduced differently. 3. Thin Plate Splines 3.1. Fitting Radial Basis Functions. Radial basis function approximations are real functions f(x) with x R d of the form (3) f(x) = N c i ρ( x x (i) ) + p(x), i=1 where p is a polynomial typically of low degree and is even zero in many cases. Radial basis functions have recently attracted a lot of attention as they can efficiently approximate smooth functions of many variables. Examples for the function

18 18 MARKUS HEGLAND ρ include Gaussians ρ(r) = exp( αr 2 ), powers and thin plate splines ρ(r) = r β and ρ(r) = r β ln(r) (for even integers β only), multiquadrics ρ(r) = (r 2 + c 2 ) β/2 and others. Radial basis functions have been generalised to metric spaces where the argument of ρ is the distance d(x, x (i) ) (and the polynomial p = 0). As always let x (i) be the data points of the predictor variables. The vector of coefficients of the radial basis functions u = (u 1,..., u N ) and the vector of coefficients of the polynomial term v = (v 1,..., v m ) are determined, in the case of smoothing, by a linear system of equations of the form: (4) [ A + αi P P T 0 ] [ u v ] = [ y 0]. The block A has elements ρ( x (i) x (j) ), the elements of the matrix P are p j (x (i) ) where the p j are the basis functions for the polynomials, the matrix I is the identity and α is the smoothing parameter. The vector y contains the observed values y (i) of the response variable. Existence, uniqueness and approximation properties, in particular for the case of interpolation (α = 0) have been well studied. Reviews of radial basis function research can be found in [22, 47, 13]. The main computational task in the determination of a fit with radial basis functions is the solution of the system of equations (4). The degree of the polynomial term p is m = O(d s ) for some small s (m = d + 1 in the case of linear polynomials) and N is very large. Thus most elements of the matrix of the linear system of equations are in the block A. While the radial basis functions ρ may be nonzero far from the origin the effects of the combination of the functions cancels out, a consequence of the equation P T c = 0. In spite of this locality the matrix A is usually mostly dense. In high dimensions this may even be true when the radial basis function ρ have a small support {x ρ(x) 0}. The reason for this is the concentration effect which is observed in high dimensions. One finds that the distribution of pairwise distances becomes very narrow so that all randomly chosen pairs of points have close to the same distance. As the support of the radial basis function has to be chosen so that for every x (i) there are at least some x j) for which it follows from the concentration and the smoothness of ρ that for any i one gets ρ( x (i) x (j) ) 0 for most other j. It follows that A is fundamentally a dense matrix. When solving the system (4) one could either use direct or iterative solvers. Due to the density of the matrix the direct solver will have complexity O(n 3 ) and an iterative solver at least complexity O(n 2 ) as it requires more than one matrix vector product. Efficient parallel algorithms for both cases are available and have been widely discussed in the literature, see, e.g. [9] for software and further references. The O(n 2 ) or O(n 3 ) complexity however implies non-scalability and thus very large problems are not feasible for this approach. Similar to the approach discussed in Section 2 one may use a bagging approach here as well and partition the data, fit a partial model to every part and average over the partial models. While this is a valid approach we will now consider a different approach Parallel and Scalable Finite Element Approximation of Thin Plate Splines. The radial basis function smoother can also be characterised variationally. In the case of thin plate splines [19] the radial basis functions are Greens functions of a partial differential equation and the radial basis smoother is the smoothing

19 PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING 19 spline which minimises (5) M α (f) = 1 n (f(x (i) ) y i ) 2 + α n i=1 Ω ( 2 ) 2 ( f(x) 2 ) 2 ( f(x) 2 ) 2 f(x) x 1 x 1 x 2 2 dx x 2 in the case of d = 2. The intuition behind this is that the parameter α allows to control the stiffness of the spline and thus the curvature of the fit. Small α will lead to a closer fit of the data where large α will lead to a smoother fit. For our computations we will consider the case of a compact domain Ω rather than Ω = R 2. The domain Ω is either R 2 or a compact subset. Instead of using the radial basis functions (for the case Ω = R 2 ) and equation (4) with linear polynomials one can develop a good approximation to this problem using numerical methods for the solution of elliptic partial differential equations, in particular finite element methods. This approach transforms the smoothing problem into a penalised finite element least squares fit. Finite element approximations have well established approximation properties, e.g., [17, 10]. The main results indicate that generic finite element approximations lead to good approximations for smooth enough functions and regular domains. We will only discuss the computational aspects further here. Fundamentally, the finite element method approximates the function f with a function from a finite dimensional linear space, in particular a space of piecewise polynomial functions. This approximation is written as a linear combination of basis functions which typically have a compact (small) support (the set of x where f(x) 0). Thus we get a representation f(x) = M u j B j (x) j=1 with basis functions B j and coefficients u j. The degrees of freedom of the approximating space is denoted by m and is typically much smaller than N. We now are looking for a function f of this form which minimises the functional M α in (5). The vector of coefficients u = [u 1,..., u m ] which minimise the quadratic function u T Mu 2b T u + 1 N yt y + αu T Su for some matrices M and S and a vector u the determination of which will be discussed later. The minimum of this function satisfies the equations (M + αs)u = b. Thus the fitting problem has now been reduced to two problems, the first one being the determination of M, S and b and the second one the solution of this system. We will now discuss both steps and see how parallel processing can substantially improve performance but maybe most importantly how the scalability problem of the original radial basis function approach has suddenly disappeared! These two steps can be found in any finite element solver. The first step is called the assembly step and the second is called solution. From the variational problem one sees that a general element of the mass matrix M is [M] jk = 1 N B j (x (i) )B k (x (i) ), N i=1

20 20 MARKUS HEGLAND the right-hand side b has the components b k = 1 N y (i) B k (x (i) ) N and the elements of the stiffness matrix S are the integrals ( 2 ) ( B k (x) 2 ) ( B j (x) 2 ) ( B k (x) 2 ) ( B j (x) 2 B k (x) [S] kj = 2 x x 1 x 1 x 2 x 1 x 2 2 x 2 Ω i=1 ) ( 2 B j (x) The complexity of the assembly step depends very much on the support of the basis functions B j. We will assume here that the support is such that for every point x at most K basis functions have B j (x) 0. Thus the complexity of the determination of b is O(KN) and one can see that the complexity of the determination of M is O(K 2 N) as for every point x there are at most K 2 products B i (x)b j (x) 0. The complexity of the determination of S is of the order O(K 2 m) in nondegenerate cases. For large data sets the time to generate the matrix will dominate. Note that this step is scalable in the data size. Moreover, it only requires one scan of the full data set. By partitioning the data into p equal parts this step can also be ideally parallelised up to a reduction step. The complexity for this stage is thus approximately T a = K2 Nτ 1 + log 2 (p)ν(m)τ 2 p where ν(m) is the number of nonzero elements of M. Typically the first part is dominating and a speedup of close to p will be obtained. The complexity of the solution step depends on the method used. Both direct methods using, e.g., parallel multifrontal algorithms [1] and iterative method using, e.g., parallel multigrid algorithms [58] are applicable. The speedups obtained for these steps is not quite as good as for the assembly. This mostly has to do with increased communication requirements but also with the fact that (in particular in the case of direct solvers) parallel algorithms require extra operations compared with the may have higher computational demands than the corresponding sequential algorithm. This extra work required for parallel execution is sometimes called redundant A Parallel Nonconforming Finite Element Method. The finite element method delivered a parallel algorithm which is scalable in the data size and thus capable to handle very large fitting problems. However, in our discussion we have glossed over one important point, namely the effect of the second derivatives in the penalty term. They lead to forth order partial derivatives in the Euler equations and can have the make the linear systems of equations ill-conditioned and hard to solve, which, e.g., leads to slow convergence of the iterative solvers. But maybe computationally even more of a problem is the fact that this requires the basis functions to be much more smooth which in turn leads to large K and matrices A and S which are more densely populated. The higher smoothness requires also that one uses higher order polynomials which are more expensive to estimate. Of course many of this problems are not unique for the fitting problem, in fact, they are known to occur for general finite element problems for plates. This contrasts very much with the situation where the penalty is the norm of the gradient of f in which case the Euler include the Laplacian, and one can use 2 x 2 ) dx.

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906

More information

Operation Count; Numerical Linear Algebra

Operation Count; Numerical Linear Algebra 10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point

More information

Solution of Linear Systems

Solution of Linear Systems Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

More information

SOLVING LINEAR SYSTEMS

SOLVING LINEAR SYSTEMS SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

AN INTRODUCTION TO NUMERICAL METHODS AND ANALYSIS

AN INTRODUCTION TO NUMERICAL METHODS AND ANALYSIS AN INTRODUCTION TO NUMERICAL METHODS AND ANALYSIS Revised Edition James Epperson Mathematical Reviews BICENTENNIAL 0, 1 8 0 7 z ewiley wu 2007 r71 BICENTENNIAL WILEY-INTERSCIENCE A John Wiley & Sons, Inc.,

More information

Factorization Theorems

Factorization Theorems Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

More information

General Framework for an Iterative Solution of Ax b. Jacobi s Method

General Framework for an Iterative Solution of Ax b. Jacobi s Method 2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,

More information

The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation

The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation Mark Baker, Garry Smith and Ahmad Hasaan SSE, University of Reading Paravirtualization A full assessment of paravirtualization

More information

Nonlinear Iterative Partial Least Squares Method

Nonlinear Iterative Partial Least Squares Method Numerical Methods for Determining Principal Component Analysis Abstract Factors Béchu, S., Richard-Plouet, M., Fernandez, V., Walton, J., and Fairley, N. (2016) Developments in numerical treatments for

More information

Mathematical Libraries on JUQUEEN. JSC Training Course

Mathematical Libraries on JUQUEEN. JSC Training Course Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries on JUQUEEN JSC Training Course May 10, 2012 Outline General Informations Sequential Libraries, planned Parallel Libraries and Application Systems:

More information

1 Introduction to Matrices

1 Introduction to Matrices 1 Introduction to Matrices In this section, important definitions and results from matrix algebra that are useful in regression analysis are introduced. While all statements below regarding the columns

More information

6. Cholesky factorization

6. Cholesky factorization 6. Cholesky factorization EE103 (Fall 2011-12) triangular matrices forward and backward substitution the Cholesky factorization solving Ax = b with A positive definite inverse of a positive definite matrix

More information

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course G63.2010.001 / G22.2420-001,

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +

More information

PARALLEL PROGRAMMING

PARALLEL PROGRAMMING PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL

More information

Parallelism and Cloud Computing

Parallelism and Cloud Computing Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication

More information

Advanced Computational Software

Advanced Computational Software Advanced Computational Software Scientific Libraries: Part 2 Blue Waters Undergraduate Petascale Education Program May 29 June 10 2011 Outline Quick review Fancy Linear Algebra libraries - ScaLAPACK -PETSc

More information

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix 7. LU factorization EE103 (Fall 2011-12) factor-solve method LU factorization solving Ax = b with A nonsingular the inverse of a nonsingular matrix LU factorization algorithm effect of rounding error sparse

More information

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

More information

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

More information

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

More information

We shall turn our attention to solving linear systems of equations. Ax = b

We shall turn our attention to solving linear systems of equations. Ax = b 59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

More information

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations Roy D. Williams, 1990 Presented by Chris Eldred Outline Summary Finite Element Solver Load Balancing Results Types Conclusions

More information

Parallel Scalable Algorithms- Performance Parameters

Parallel Scalable Algorithms- Performance Parameters www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Clustering and scheduling maintenance tasks over time

Clustering and scheduling maintenance tasks over time Clustering and scheduling maintenance tasks over time Per Kreuger 2008-04-29 SICS Technical Report T2008:09 Abstract We report results on a maintenance scheduling problem. The problem consists of allocating

More information

(Quasi-)Newton methods

(Quasi-)Newton methods (Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

More information

Big-data Analytics: Challenges and Opportunities

Big-data Analytics: Challenges and Opportunities Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at 台 灣 資 料 科 學 愛 好 者 年 會, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.)

More information

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008 A tutorial on: Iterative methods for Sparse Matrix Problems Yousef Saad University of Minnesota Computer Science and Engineering CRM Montreal - April 30, 2008 Outline Part 1 Sparse matrices and sparsity

More information

Parallel Computing for Data Science

Parallel Computing for Data Science Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers Variance Reduction The statistical efficiency of Monte Carlo simulation can be measured by the variance of its output If this variance can be lowered without changing the expected value, fewer replications

More information

Smoothing and Non-Parametric Regression

Smoothing and Non-Parametric Regression Smoothing and Non-Parametric Regression Germán Rodríguez grodri@princeton.edu Spring, 2001 Objective: to estimate the effects of covariates X on a response y nonparametrically, letting the data suggest

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Applications to Data Smoothing and Image Processing I

Applications to Data Smoothing and Image Processing I Applications to Data Smoothing and Image Processing I MA 348 Kurt Bryan Signals and Images Let t denote time and consider a signal a(t) on some time interval, say t. We ll assume that the signal a(t) is

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

10.2 ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS. The Jacobi Method

10.2 ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS. The Jacobi Method 578 CHAPTER 1 NUMERICAL METHODS 1. ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS As a numerical technique, Gaussian elimination is rather unusual because it is direct. That is, a solution is obtained after

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

HSL and its out-of-core solver

HSL and its out-of-core solver HSL and its out-of-core solver Jennifer A. Scott j.a.scott@rl.ac.uk Prague November 2006 p. 1/37 Sparse systems Problem: we wish to solve where A is Ax = b LARGE Informal definition: A is sparse if many

More information

How To Get A Computer Science Degree At Appalachian State

How To Get A Computer Science Degree At Appalachian State 118 Master of Science in Computer Science Department of Computer Science College of Arts and Sciences James T. Wilkes, Chair and Professor Ph.D., Duke University WilkesJT@appstate.edu http://www.cs.appstate.edu/

More information

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi

More information

Mathematics (MAT) MAT 061 Basic Euclidean Geometry 3 Hours. MAT 051 Pre-Algebra 4 Hours

Mathematics (MAT) MAT 061 Basic Euclidean Geometry 3 Hours. MAT 051 Pre-Algebra 4 Hours MAT 051 Pre-Algebra Mathematics (MAT) MAT 051 is designed as a review of the basic operations of arithmetic and an introduction to algebra. The student must earn a grade of C or in order to enroll in MAT

More information

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*

More information

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES From Exploratory Factor Analysis Ledyard R Tucker and Robert C MacCallum 1997 180 CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES In

More information

7 Gaussian Elimination and LU Factorization

7 Gaussian Elimination and LU Factorization 7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 16: Generalized Additive Models Regression III: Advanced Methods Bill Jacoby Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Goals of the Lecture Introduce Additive Models

More information

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

More information

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

More information

THE NAS KERNEL BENCHMARK PROGRAM

THE NAS KERNEL BENCHMARK PROGRAM THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Vector and Matrix Norms

Vector and Matrix Norms Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

More information

http://www.jstor.org This content downloaded on Tue, 19 Feb 2013 17:28:43 PM All use subject to JSTOR Terms and Conditions

http://www.jstor.org This content downloaded on Tue, 19 Feb 2013 17:28:43 PM All use subject to JSTOR Terms and Conditions A Significance Test for Time Series Analysis Author(s): W. Allen Wallis and Geoffrey H. Moore Reviewed work(s): Source: Journal of the American Statistical Association, Vol. 36, No. 215 (Sep., 1941), pp.

More information

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning Clustering Very Large Data Sets with Principal Direction Divisive Partitioning David Littau 1 and Daniel Boley 2 1 University of Minnesota, Minneapolis MN 55455 littau@cs.umn.edu 2 University of Minnesota,

More information

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 10

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 10 Lecture Notes to Accompany Scientific Computing An Introductory Survey Second Edition by Michael T. Heath Chapter 10 Boundary Value Problems for Ordinary Differential Equations Copyright c 2001. Reproduction

More information

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE 1 P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE JEAN-MARC GRATIEN, JEAN-FRANÇOIS MAGRAS, PHILIPPE QUANDALLE, OLIVIER RICOIS 1&4, av. Bois-Préau. 92852 Rueil Malmaison Cedex. France

More information

How To Make A Credit Risk Model For A Bank Account

How To Make A Credit Risk Model For A Bank Account TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Notes on Cholesky Factorization

Notes on Cholesky Factorization Notes on Cholesky Factorization Robert A. van de Geijn Department of Computer Science Institute for Computational Engineering and Sciences The University of Texas at Austin Austin, TX 78712 rvdg@cs.utexas.edu

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

1 Determinants and the Solvability of Linear Systems

1 Determinants and the Solvability of Linear Systems 1 Determinants and the Solvability of Linear Systems In the last section we learned how to use Gaussian elimination to solve linear systems of n equations in n unknowns The section completely side-stepped

More information

System Identification for Acoustic Comms.:

System Identification for Acoustic Comms.: System Identification for Acoustic Comms.: New Insights and Approaches for Tracking Sparse and Rapidly Fluctuating Channels Weichang Li and James Preisig Woods Hole Oceanographic Institution The demodulation

More information

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS Structure Page Nos. 2.0 Introduction 27 2.1 Objectives 27 2.2 Types of Classification 28 2.3 Flynn s Classification 28 2.3.1 Instruction Cycle 2.3.2 Instruction

More information

Contributions to Gang Scheduling

Contributions to Gang Scheduling CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

HPC Deployment of OpenFOAM in an Industrial Setting

HPC Deployment of OpenFOAM in an Industrial Setting HPC Deployment of OpenFOAM in an Industrial Setting Hrvoje Jasak h.jasak@wikki.co.uk Wikki Ltd, United Kingdom PRACE Seminar: Industrial Usage of HPC Stockholm, Sweden, 28-29 March 2011 HPC Deployment

More information

MATHEMATICAL METHODS OF STATISTICS

MATHEMATICAL METHODS OF STATISTICS MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

Mathematical Libraries and Application Software on JUROPA and JUQUEEN

Mathematical Libraries and Application Software on JUROPA and JUQUEEN Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUROPA and JUQUEEN JSC Training Course May 2014 I.Gutheil Outline General Informations Sequential Libraries Parallel

More information

Chapter 12 File Management

Chapter 12 File Management Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Overview File organisation and Access

More information

Resource Allocation Schemes for Gang Scheduling

Resource Allocation Schemes for Gang Scheduling Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian

More information

Chapter 12 File Management. Roadmap

Chapter 12 File Management. Roadmap Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Overview Roadmap File organisation and Access

More information

Fast Multipole Method for particle interactions: an open source parallel library component

Fast Multipole Method for particle interactions: an open source parallel library component Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,

More information

NOV - 30211/II. 1. Let f(z) = sin z, z C. Then f(z) : 3. Let the sequence {a n } be given. (A) is bounded in the complex plane

NOV - 30211/II. 1. Let f(z) = sin z, z C. Then f(z) : 3. Let the sequence {a n } be given. (A) is bounded in the complex plane Mathematical Sciences Paper II Time Allowed : 75 Minutes] [Maximum Marks : 100 Note : This Paper contains Fifty (50) multiple choice questions. Each question carries Two () marks. Attempt All questions.

More information

MPI Hands-On List of the exercises

MPI Hands-On List of the exercises MPI Hands-On List of the exercises 1 MPI Hands-On Exercise 1: MPI Environment.... 2 2 MPI Hands-On Exercise 2: Ping-pong...3 3 MPI Hands-On Exercise 3: Collective communications and reductions... 5 4 MPI

More information

Interactive comment on A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc by W. He et al.

Interactive comment on A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc by W. He et al. Geosci. Model Dev. Discuss., 8, C1166 C1176, 2015 www.geosci-model-dev-discuss.net/8/c1166/2015/ Author(s) 2015. This work is distributed under the Creative Commons Attribute 3.0 License. Geoscientific

More information

ECE 842 Report Implementation of Elliptic Curve Cryptography

ECE 842 Report Implementation of Elliptic Curve Cryptography ECE 842 Report Implementation of Elliptic Curve Cryptography Wei-Yang Lin December 15, 2004 Abstract The aim of this report is to illustrate the issues in implementing a practical elliptic curve cryptographic

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.

More information

Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.

Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No. Some Computer Organizations and Their Effectiveness Michael J Flynn IEEE Transactions on Computers. Vol. c-21, No.9, September 1972 Introduction Attempts to codify a computer have been from three points

More information

Computational Optical Imaging - Optique Numerique. -- Deconvolution --

Computational Optical Imaging - Optique Numerique. -- Deconvolution -- Computational Optical Imaging - Optique Numerique -- Deconvolution -- Winter 2014 Ivo Ihrke Deconvolution Ivo Ihrke Outline Deconvolution Theory example 1D deconvolution Fourier method Algebraic method

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 19: SVD revisited; Software for Linear Algebra Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 9 Outline 1 Computing

More information

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information