SSketch: An Automated Framework for Streaming Sketch-based Analysis of Big Data on FPGA
|
|
|
- Logan Chapman
- 10 years ago
- Views:
Transcription
1 SSketch: An Automated Framework for Streaming Sketch-based Analysis of Big Data on FPGA Bita Darvish Rouhani, Ebrahim M. Songhori, Azalia Mirhoseini, and Farinaz Koushanfar Electrical and Computer Engineering Department, Rice University Houston Texas, USA {bita, ebrahim, azalia, Abstract This paper proposes SSketch, a novel automated computing framework for FPGA-based online analysis of big data with dense (non-sparse) correlation matrices. SSketch targets streaming applications where each data sample can be processed only once and storage is severely limited. The stream of input data is used by SSketch for adaptive learning and updating a corresponding ensemble of lower dimensional data structures, a.k.a., a sketch matrix. A new sketching methodology is introduced that tailors the problem of transforming the big data with dense correlations to an ensemble of lower dimensional subspaces such that it is suitable for hardware-based acceleration performed by reconfigurable hardware. The new method is scalable, while it significantly reduces costly memory interactions and enhances matrix computation performance by leveraging coarse-grained parallelism existing in the dataset. To facilitate automation, SSketch takes advantage of a HW/SW co-design approach: It provides an Application Programming Interface (API) that can be customized for rapid prototyping of an arbitrary matrixbased data analysis algorithm. Proof-of-concept evaluations on a variety of visual datasets with more than 11 million nonzeros demonstrates up to 2 folds speedup on our hardwareaccelerated realization of SSketch compared to a software-based deployment on a general purpose processor. Keywords-Streaming model; Big data; Dense matrix; FPGA; Low-rank matrix; HW/SW co-design; Matrix sketching I. INTRODUCTION Computers and sensors continually generate data at an unprecedented rate. Collections of massive data are often represented as large m n matrices, where n is the number of samples and m is the corresponding number of features. Growth of big data is challenging traditional matrix analysis methods like Singular Value Decomposition (SVD) and Principal Component Analysis (PCA). Both SVD and PCA incur a large memory footprint with an O(m 2 n) computational complexity which limit their practicability in big data regime. This disruption of convention changes the way we analyze modern datasets and makes designing scalable factorization methods, a.k.a., sketching algorithms, a necessity. With a properly designed sketch matrix the intended computations can be performed on an ensemble of lower dimensional structures rather than the original matrix without a significant loss. In the big data regime, there are at least two sets of challenges that should be addressed simultaneously to optimize the performance. The first challenge class is to minimize the resource requirements for obtaining the data sketch within an error threshold in a timely manner. This favors designing sketching methods with a scalable computational complexity which can be readily scale up for analyzing a large amount of data. The second challenge class has to do with mapping of computation to increasingly heterogeneous modern architectures/accelerators. The cost of computing on these architectures is dominated by message passing for moving the data to/from the memory and inter-cores. What exacerbates the cost is the iterative nature of dense matrix calculations that require multiple rounds of message passing. To optimize the cost, communication between the processors and memory hierarchy levels must be minimized. Therefore, the trade-off between memory communication and redundant local calculations should be carefully leveraged to improve the performance of iterative computations. Several big data analysis applications require real-time and online processing of streaming data, where storing the original big matrix becomes superfluous and the data is read at most once [1]. In these scenarios, the sketch has to be found online. A large body of earlier work demonstrated the efficiency of using custom hardware for acceleration of traditional matrix sketching algorithms such as QR [2], LU [3], Cholesky [4], and SVD [5]. However, the existing hardware-accelerated sketching methods either have a higher-than-linear complexity [6], or are non-adaptive for online sketching [5]. They are thus unsuitable for streaming applications and big data analysis with dense correlation matrices. A recent theoretical solution for scalable sketching of big data matrices is presented in [1], which also relies on running SVD on the sketch matrix. Even this method does not scale to larger sketch sizes. To the best of our knowledge, no hardware acceleration for streaming sketches has been reported in the literature. Despite the apparent density and dimensionality of big data, in many settings, the information may lie on a single or a union of lower dimensional subspaces [7]. We propose SSketch, a novel automated framework for efficient analysis and hardware-based acceleration of massive and densely correlated datasets in streaming applications. It leverages the hybrid structure of data as an ensemble of lower dimensional subspaces. SSketch algorithm is a scalable approach for dynamic sketching of massive datasets that works by factorizing the original (densely correlated) large matrix into two new matrices: (i) a dense but much smaller dictionary matrix which includes a set of atoms learned from the input data, and (ii) a large block-sparse matrix where the blocks are organized such that the subsequent matrix computations incur a minimal amount of message passings on the target platform. As the stream of input data arrives, SSketch adaptively learns from the incoming vectors and updates the sketch of the collection. An accompanying API is also provided by our work so designers can utilize the scalable SSketch framework for rapid prototyping of an arbitrary matrix-based data analysis
2 algorithm. SSketch and its API target a widely used class of data analysis algorithms that model the data dependencies by iteratively updating a set of matrix parameters, including but not limited to most regression methods, belief propagation, expectation maximization, and stochastic optimizations [8]. Our framework addresses the big data learning problem by using the SSketch s block-sparse matrix and applying an efficient, greedy routine called Orthogonal Matching Pursuit (OMP) on each sample independently. Note that OMP is a key computational kernel that dominates the performance of many sparse reconstruction algorithms. Given the wide range of applications, it is thus not surprising that a large number of OMP implementations on GPUs [9], ASICs [1] and FPGAs [11] have been reported in the literature. However, the prior work on FPGA had focused on fixed-point number format. In addition, most earlier hardware-accelerated OMP designs are restricted to fixed and small matrix sizes and can handle only a few number of OMP iterations [21], [11], [12]. Such designs cannot be readily applied to massive, dynamic datasets. In contrast, SSketch s scalable methodology introduces a novel approach that enables applying OMP on matrices with varying large sizes and supports arbitrary number of iterations. SSketch utilizes the abundant hardware resources on current FPGAs to provide a scalable, floating-point implementation of OMP for sketching purposes. One may speculate that GPUs may show a better acceleration performance than FPGAs. However, the performance of GPU accelerators is limited in our application because of two main reasons. First, for streaming applications, the memory hierarchy in GPUs increases the overhead in communication and thus reduces the throughput of the whole system. Second, in SSketch framework the number of required operations to compute the sketch of each individual sample depends on the input data structure and may vary from one sample to the other. Thus, the GPU s applicability is reduced due to its Single Instruction Multiple Data (SIMD) architecture. The explicit contributions of this paper are as follows: We propose SSketch, a novel communication-minimizing framework for online (streaming) large matrix computation. SSketch adaptively learns the hybrid structure of the input data as an ensemble of lower dimensional subspaces and efficiently forms the sketch matrix of the ensemble. We develop a novel streaming factorization method for FPGA acceleration. Our sketching algorithm benefits from a fixed, low memory footprint and an O(mn) computational complexity. We design an API to facilitate automation and adaptation of SSketch s scalable and online matrix sketching method for rapid prototyping of an arbitrary matrix-based data analysis algorithm. We devise SSketch with a scalable, floating-point implementation of Orthogonal Matching Pursuit (OMP) algorithm on FPGA. We evaluate our framework with three different massive datasets. Our evaluations corroborate SSketch scalability and practicability. We compare the SSketch runtime to a software realization on CPU and also report its overhead. II. BACKGROUND AND PRELIMINARIES A. Streaming Model Modern data matrices are often extremely large which require distribution of computation beyond a single core. Some prominent examples of such massive datasets include image acquisition, medical signals, recommendation systems, wireless communication, and internet user s activities [13]. The dimensionality of the modern data collections renders usage of traditional algorithms infeasible. Therefore, matrix decomposition algorithms should be designed to be scalable and pass-efficient. In pass-efficient algorithms, data is read at most a constant number of times. Streaming-based algorithm refers to a pass-efficient computational model that requires only one pass through the dataset. By taking advantage of a streaming model, the sketch of a collection can be obtained online which makes storing the original matrix superfluous [14]. B. Orthogonal Matching Pursuit OMP is a well-known greedy algorithm for solving sparse approximation problems. It is a key computational kernel in many compressed sensing algorithms. OMP has wide applications ranging from classification to structural health monitoring. As we describe in Alg. 2, OMP takes a dictionary and a signal as inputs and iteratively approximates the sparse representation of the signal by adding the best fitting element in every iteration. More details regarding the OMP algorithm are presented in Section V. C. Notation We write vectors in bold lowercase script, x, and matrices in bold uppercase script, A. Let A t denote the transpose of A. A j represents the j th column and A ij represents the entry at the ith row and jth column of matrix A. A λ is a subset of matrix A consisting of the columns defined in the set λ. nnz(a) defines the number of non-zeros in the matrix A. x p = ( n j=1 x(j) p ) 1/p is used as the p-norm of a vector where p 1. The Frobenius norm of matrix A is defined by A F = ( j,j A(i, j) 2 Ax ) and A 2 = max 2 x x 2 is considered as spectral norm. The matrix A (domain data) is of size m n where m n for over-complete matrices. III. RELATED WORK In settings where the column span of A admits a lower dimensional embedding, the optimal low-rank approximation of the data is obtained by SVD or PCA algorithm [15]. However, their O(m 2 n) complexity makes it hard to utilize these wellknown algorithms for massive datasets. Unlike PCA, Sparse PCA (SPCA) is modified to find principal components with sparse loadings, which is desirable for interpreting data and storage reduction [16]. However, the computational complexity of SPCA is similar to classic SVD. Thus, it is not scalable for analyzing massive, dense datasets [16], [17]. Recently, the efficiency of random subsampling methods to compute the lower dimensional embedding of large datasets has been shown [18], [3]. Random Column Subset Selection (rcss) approach has been used to provide scalable strategies
3 for factorizing large matrices [18]. Authors in [18] propose a theoretical scalable approach for large matrix factorization, but the hardware constraints are not considered in their work. The large memory footprint and non-adaptive structure of their rcss approach make it unsuitable for streaming applications. A recent approach in [7] suggests a distributed framework based upon a scalable and sparsity-inducing data transformation algorithm. The framework enables efficient execution of large-scale iterative learning algorithms on massive and dense datasets. SSketch framework is developed based upon a novel extension of the proposed sketching method in [7]. Unlike the work in [7], our data sketching approach is well-suited for streaming applications and is amenable to FPGA acceleration. OMP has been shown to be very effective in inducing sparsity, although its complexity makes it costly for streaming applications. A number of implementations on GPU [9], [19], [2], ASICs [1], and FPGAs [21], [11], [22] are reported in the literature to speed up this complex reconstruction algorithm. FPGA implementation of OMP for problems of dimension and are developed for signals with sparseness of 5 and 36 respectively [21], [11]. To the best of our knowledge, none of the previous implementations of OMP is devised for streaming applications with large and densely correlated data matrices. In addition, use of fixedpoint format to compute and store the results limits their applicability for sketching purposes. IV. SSKETCH GLOBAL FLOW The global flow of SSketch is presented in Fig. 1. SSketch takes the stream of a massive, dynamic dataset in the matrix form as its input and computes/updates a sketch matrix of the collection as its output. Our sketch formation algorithm is devised to minimize the costly message passings to/from the memory and cores, thereby it reduces the communication delay and energy. All SSketch s computations are done in IEEE 754 single precision floating-point format. SSketch framework is developed based upon a novel sketching algorithm that we introduce in Section V. As illustrated in Fig. 1, SSketch consists of two main components to compute the sketch of dynamic data collections: (i) a dictionary learning unit, and (ii) a data sketching unit. As the stream of data comes in, the first component adaptively learns a dictionary as a subsample of input data such that the hybrid structure of data is well captured within the learned dictionary. Next, the data sketching unit solves a sparse approximation problem using the OMP algorithm to compute a block-sparse matrix. In the data sketching unit, the representation of each new arrival sample is computed based on the current values of the dictionary, and the result is sent back to the host. SSketch s accompanying API facilitates automation and adaptation of our proposed framework for rapid prototyping of an arbitrary matrix-based data analysis algorithm. V. SSKETCH METHODOLOGY Many modern massive datasets are either low-rank or lie on a union of lower dimensional subspaces. This convenient property can be leveraged to efficiently map the data to an ensemble of lower dimensional data structures [7]. The authors in [7] suggest a distributed framework based upon a scalable and sparsity-inducing solution to find the sketch of large and dense datasets such that: minimize A DV F subject to V kn, (1) DεR m l,vεr l n where A m n is the input data, D m l is the dictionary matrix, V l n is the block-sparse matrix, and l m n. V measures the total number of non-zeros in V, and k is the target sparsity level for each input sample. Their approach, however, is static and does not adaptively update the dictionary at runtime. The only way to update is to redo the dictionary computation which would incur a higher cost and is unsuitable for streaming applications with a single pass requirement and limited memory. We develop SSketch based upon a novel extension of [7] for streaming applications. SSketch tailors the solution of (1) according to the underlying platform s constraints. Our approach, incurs a lower memory footprint and is well-suited for scenarios where storage is severely limited. A. SSketch Algorithm Our platform-aware matrix sketching algorithm is summarized in Alg. 1. SSketch algorithm, approximates matrix A as a product of two other matrices (A m n D m l V l n ) based on a streaming model. Algorithm 1 SSketch algorithm Inputs: Measurement matrix A, projection threshold α, sparsity level k, error threshold ɛ, and dictionary size l. Output: Matrix D, and coefficient matrix V. 1: D empty 2: j 3: for i = 1,...,n do 4: W (A i ) = D(Dt D) 1 D t A i A i 2 A i 2 5: if W (A i ) > α and j < l then 6: D j = A i / A i 2 7: V ij = A i 2 8: j j + 1 9: else 1: V i OMP (D, A i, k, ɛ) end if end for For each newly arriving sample, our algorithm first calculates a projection error, W(A i ), based on the current values of the dictionary matrix D. Next, it compares the calculated error with a user-defined projection threshold α and updates the sketch accordingly. SSketch locally updates the dictionary matrix D based on each arriving data sample and makes use of the greedy OMP routine to compute the block-sparse matrix V. OMP can be used, either by fixing the number of nonzeros in each column of V (sparsity level k) or by fixing the total amount of approximation error (error threshold ɛ) for each sample. Factorizing the input matrix A as a product of two matrices with much fewer non-zeros than the original data, induces an approximation error that can be controlled by tuning the error threshold (ɛ), dictionary size (l), and projection
4 Fig. 1: High level block diagram of SSketch. It takes stream of data as input and adaptively learns a corresponding low-rank sketch of the collection by doing computation at the level of matrix rank. The resulting sketch is then sent back to the host for further analysis depending on the application. threshold (α) in SSketch framework. In our experiments, we consider Frobenius norm error (Xerr = A DV F A F ), as sketch accuracy metric. As can be seen, SSketch algorithm requires only one pass through each arriving sample. This method only requires storing a single column of the input matrix A and the matrix D at a time. Note that the dictionary matrix D m l is constructed by columns of data matrix A m n. The column space of D is contained in the column space of A. Thus, rank(dd + A) = rank(d) l m. It simply implies that for over-complete datasets OMP computation is required for n l columns and the overhead time of copying D is ignorable due to its small size compared to A. OMP with QR Decomposition. As we describe in Section VII, computational complexity of the projection step (line 4 of Alg. 1) is small compared to the O(mnl 2 ) complexity of the OMP algorithm. Thus, the computational bottleneck of SSketch algorithm is OMP. To boost the computational performance of SSketch for analyzing a large amount of data on FPGA, it is necessary to modify the OMP algorithm such that it maximally benefits from the available resources and incurs a scalable computational complexity. Alg. 2 demonstrates the pseudocode of OMP where ɛ is the error threshold, and k is the target sparsity level. The Least Squares (LS) minimization step (line 5 of Alg. 2) involves a variety of operations with complex data flows that introduce an extra hardware complexity. However, proper use of factorization techniques like QR decomposition or Cholesky method within the OMP algorithm would reduce its hardware implementation complexity and make it well-suited for hardware accelerators [11], [23]. Thus, to efficiently solve the LS optimization problem in line 5 of Alg. 2, we decide to use QR decomposition (Alg. 3). QR decomposition returns an orthogonal matrix Q and an upper-triangular matrix R. It iteratively updates the decomposition by reusing the Q and R matrices from the last OMP iteration. In this approach, the residual (line 6 of Alg. 2) can be updated by r i r i 1 Q i (Q i ) t r i 1. The final solution is calculated by performing back substitution to solve the inversion of the matrix R in v k = R 1 Q t A i. Assuming that matrix A is of size m n and D is of size m l, then the complexity of the OMPQR is O(mnl 2 ). This complexity is linear in terms of m and n since in many settings l is much smaller in compared to m and n. This linear Algorithm 2 OMP algorithm Inputs: Matrix D, measurement A i, sparsity level k, threshold error ɛ. Output: Support set Λ and k-dimensional coefficient vector v. 1: r A i 2: Λ 3: for i = 1,...,k do 4: Λ Λ argmax j < r i 1, D j > Find best fitting column 5: v i argmin v r i 1 D Λ iv 2 2 LS Optimization 6: r i r i 1 D Λ iv i Residual Update end for Algorithm 3 Incremental QR decomposition by modified Gram-Schmidt Inputs: New column D Λ s, last iteration Q s 1, R s 1. Output: Q s and R s. 1: ( ) R s R s 1 2: ξ s D Λ s 3: for j = 1,...,s-1 do 4: R js s (Q s 1 ) j H ξ s 5: ξ s ξ s R s s 1 js Q j end for 6: R s ss ξ s 2 2 7: Q s [Q s 1, ξ s R ss s ] complexity enables SSketch to readily scale up for processing a large amount of data based on a streaming model. B. Blocking SSketch. Let A = [ A 1 ; A 2 ; A 3 ] be a matrix consisting of rows A 1, A 2, and A 3 that are stacked on the top of one another. Our key observation is that if we obtain the sketch of each block independently and combine the resulting sketches (blocking SSketch) as illustrated in Fig. 2, then the combined sketch can be as good as sketching A directly (nonblocking SSketch) in terms of error-performance trade-off. This property can be generalized to any number of partitions of A. We leverage this convenient property to increase the performance of our
5 proposed framework for sketching massive datasets based on a streaming model. In blocking SSketch, the data matrix A is divided into more manageable sized blocks such that there exist enough block RAMs on FPGA to store the corresponding D and a single column of that block. The blocking SSketch achieves a significant bandwidth saving, faster load/store, less communication traffic between kernels, and a fixed memory requirement on FPGA. The methodology also provides the capability of factorizing massive, dense datasets in an online streaming model. Independent analysis of each block is especially attractive if the data is distributed across multiple machines. In such settings, each platform can independently compute a local sketch. These sketches can then be combined to obtain the sketch of the original collection. Given a fixed memory budget for the matrix D, as it is presented in Section VII, blocking SSketch results in a more accurate approximation compared with nonblocking SSketch. The blocking SSketch computations are done on smaller segments of data which confers a higher system performance. The achieved higher accuracy is at the cost of a larger number of non-zeros in V. Note that as our evaluation results imply, designers can reduce number of non-zeros in the computed block-sparse matrix by increasing the error threshold ɛ in SSketch algorithm. Fig. 2: Schematic depiction of blocking SSketch. VI. SSKETCH AUTOMATED HARDWARE IMPLEMENTATION In this section, we discuss the details of SSketch hardware implementation. After applying preprocessing steps on the stream of the input data for dictionary learning, SSketch sends the data to FPGA through a 1Gbps Ethernet port. SSketch is devised with multiple OMP kernels and a control unit to efficiently compute the block-sparse matrix V. As the stream of data arrives, the control unit looks for availability of OMP kernels and assigns the newly arrived sample to an idle kernel for further processing. The control unit also has the responsibility of reading out the outputs and sending back the results to the host. SSketch API provides designers with a user-friendly interface for rapid prototyping of arbitrary matrix-based data analysis algorithms and realizing streaming applications on FPGAs (Fig. 3). Algorithmic parameters of SSketch including the projection threshold α, error threshold ɛ, dictionary size l, target block-sparsity level k, and blocksize m b, can be easily changed through the SSketch API. The SSketch API is open source and is freely available on our website [24]. In OMP hardware implementation, we utilize several techniques to reduce the iteration interval of two successive operations and exploit the parallelism within the algorithm. We observe that the OMP algorithm includes multiple dot product computations which result in frequent appearance of for-loops requiring an operation similar to a + = b[i] c[i]. We use Fig. 3: High level diagram of SSketch API. It takes the stream of input data in the matrix form as its input. SSketch divides the input data matrix into more manageable blocks of size m b and adaptively learns the corresponding sketch of the data. SSketch gets its algorithmic parameters such as the projection threshold α, error threshold ɛ, dictionary size l, block-sparsity level k, and block-size m b from user and computes the sketch accordingly. This learned data sketch can then be used for building a domain-specific architecture that scalably performs data analysis on an ensemble of lower dimensional structures rather than the original matrix without a significant loss. a tree-based reduction module by implementing a tree-based adder to accelerate the dot product and norm computation steps that appear frequently in the OMP routine. By means of the reduction module, SSketch is able to reduce the iteration interval and handle more operations simultaneously. As such, SSketch requires multiple concurrent loads and stores from a particular RAM. To cope with the concurrency, instead of having a large block RAM for matrices D and Q, we use multiple smaller sized block memories and fill these block RAMs by cyclic interleaving. Thus, we can perform a faster computation by accessing multiple successive elements of the matrices and removing the dependency in the for-loops. Using the block RAM is desirable in FPGA implementations because of its fast access time. The number of block RAMs on one FPGA is limited, so it is important to reduce the amount of utilized block memories. We reduce block RAM utilization in our realization by a factor of 2 compared to the naive implementation. This reduction is a consequence of our observation that none of the columns of matrix D would be selected twice during one call of the OMP algorithm. Thus, for computing line 4 of Alg. 2 we only use the indices of D that are not selected during the previous OMP iterations. We instead use the memory space that was originally assigned to the selected columns of D to store the matrix Q. By doing so we reduce the block RAM utilization, which allows SSketch to employ more OMP kernels in parallel. VII. EXPERIMENTAL EVALUATIONS For evaluation purposes, we apply our methodology on three sets of data: (i) Light field, (ii) Hyperspectral images,
6 and (iii) Synthetic data. To ensure that dictionary learning is independent of the data order, for each fixed set of algorithmic parameters we shuffle the data before each calling of SSketch algorithm. The mean error value for 1 calls of SSketch algorithm and its variance are reported in Fig. 4, and 5. In all cases, we observe that the variance of the error value is two to three orders of magnitude less than the mean value, implying that SSketch algorithm has a low dependency on the data arrival order. This convenient property of SSketch algorithm is promising for adaptive dictionary learning and sketching purposes in streaming applications. A. Light Field Experiments A light field image is a set of multi-dimensional array of images that are simultaneously captured from slightly different viewpoints. Promising capabilities of light field imaging include the ability to define the field s depth, focus or refocus on a part of image, and reconstruct a 3D model of the scene [25]. For evaluating SSketch algorithm accuracy, we run our experiments on a light field data consisting of 25 samples each of which constructed of patches. The light field data results in a data matrix with 4 million non-zero elements. We choose this moderate input matrix size to accommodate the SVD algorithm for comparison purposes and enable the exact error measurement especially for correlation matrix (a.k.a., Gram matrix) approximation. The Gram matrix of a data collection consists of the Hamiltonian inner products of data vectors. The core of several important data analysis algorithms is the iterative computation on the data Gram matrix. Examples of Gram matrix usage include but are not limited to kernelbased learning and classification methods, as well as several regression and regularized least squares routines [31]. In SSketch, the number of selected columns l for the dictionary has a direct effect on the convergence error rate as well as speed. Factorization with larger l achieves a smaller convergence error but decreases the performance due to the increase in computation. Fig. 4a and 4d report the results of applying both nonblocking and blocking SSketch on the data. We define the approximation error for the input data and its corresponding Gram matrix as Xerr = A à F A F, and Gerr = AtA ÃT à F A t A F, where à = DV. Given a fixed memory budget for the matrix D, Fig. 4a and 4d illustrate that the blocking SSketch results in a more accurate sketch compared with the nonblocking one. Number of rows in each block of input data (block-size) has a direct effect on SSketch s performance. Fig. 4b, 4e, and 4c demonstrate the effect of block-size on the data and Gram matrix approximation error as well as matrix compression-rate, where the compression-rate is defined as nnz(d)+nnz(v) nnz(a). In this setting, the higher accuracy of blocking SSketch is at the cost of a larger number of non-zeros in the block-sparse matrix V. As illustrated in Fig. 4b, 4e, and 4c designers can easily reduce the number of non-zeros in the matrix V by increasing the SSketch error threshold ɛ. In SSketch algorithm, there is a trade-off between the sketch accuracy and the number of nonzeros in the block-sparse matrix V. The optimal performance of SSketch methodology is determined based on the error threshold and the hardware constraints. Considering the spectral norm error (e k = A à 2 A 2 ) instead of Frobenius norm error, the theoretical minimal error can be expressed as σ k+1 = min( A A k 2 : A k has rank k), where σ k is the exact k th singular value of A and A k is obtained by SVD [26]. Fig. 4f compares e k and the theoretical minimal error for the light field dataset. B. Hyperspectral Experiments A hyperspectral image is a sequence of images generated by hundreds of detectors that capture the information from across the electromagnetic spectrum. With this collected information, one would obtain a spectrum signature for each pixel of the image that can be used for identifying or detecting the material [27]. Hyperspectral imaging is a new type of high-dimensional image data and a promising tool for applications in earth-based and planetary exploration, geo-sensing, and beyond. This fast and non-destructive technique provides a large amount of spectral and spatial information on a wide number of samples. However, the large size of hyperspectral datasets limits the applicability of this technique, especially for scenarios where online evaluation of a large number of samples is required. We adapt SSketch framework to capture the sketch of each pixel for the purpose of enhancing the computational performance and reducing the total storage requirement for hyperspectral images. In this experiment, the SSketch algorithm is applied on two different hyperspectral datasets. The first dataset [28] is and the second one [29] is of size Our experiments show that SSketch algorithm results in the same trend as in 4a and 4d for both hyperspectral datasets. As Fig. 5 illustrates, the Gram matrix factorization error reaches to less than for l 1 in both datasets. Gerr x 1-6 Hyperspectral data1 Hyperspectral data l (number of dictionary samples) Fig. 5: Factorization error (Gerr) vs. l for two different hyperspectral datasets. The SSketch algorithmic parameters are set to α =.1 and ɛ =.1, where α is the projection threshold and ɛ is the error threshold. C. Scalability We provide a comparison between the complexity of different implementations of the OMP routine in Fig. 6. Batch OMP (BOMP) is a variation of OMP that is especially optimized for sparse-coding of large sets of samples over the same dictionary. BOMP requires more memory space compared with the conventional OMP, since it needs to store D T D along with the matrix D. BOMPQR and the OMPQR both have near linear computational complexity. We use OMPQR method in our target architecture as it is more memory-efficient.
7 Xerr Blocking SSketch Nonblocking SSketch l (number of dictionary samples) (a) Factorization error (Xerr) vs. l with α =.1, block-size = 2 and ɛ =.1. Gerr Blocking SSketch Nonblocking SSketch Xerr =.1 =.4 = Block-size (m b ) (b) Factorization error (Xerr) vs. blocksize with α =.1 and l = 128. Gerr x 1-3 =.1 =.4 =.75 Compression-rate =.1 =.4 = Block-size (m b ) (c) Compression-rate vs. block-size with α =.1 and l = 128. e k SSketch SVD l (number of dictionary samples) (d) Factorization error (Gerr) vs. l with α =.1, block-size = 2 and ɛ = Block-size(m b ) (e) Factorization error (Gerr) vs. blocksize with α =.1 and l = l (number of dictionary samples) (f) Spectral factorization error (e k ) vs. l compared to the minimal possible error. Fig. 4: Experimental evaluations of SSketch. α and ɛ represent the user-defined projection threshold, and the error threshold respectively. l is the number of samples in the dictionary matrix, and m b is the block-size. Relative computation time OMP BOMPQR BOMP OMPQR 1 6 Matrix Size (m) Fig. 6: Computational complexity comparison of different OMP implementations. Using QR decomposition significantly improves OMP s runtime. The complexity of our OMP algorithm is linear both in terms of m and n, so dividing A m n into several blocks along the dimension of m and processing each block independently does not add to the total computational complexity of the algorithm. However, it shrinks the data size to fit into the FPGA block RAMs and improves the performance. Let T OMP (m, l, k) stand for the number of operations required to obtain the sketch of a vector of length m with target sparsity level k. Then the runtime of the system is a linear function of T OMP which makes the proposed architecture scalable for factorizing large matrices. The complexity of projection step in SSketch algorithm (line 4 of Alg. 1) is (l 3 + 2lm + 2l 2 m). However, if we decompose D m l to Q m m R m l and replace DD + with QI t Q t, then the projection step s computational complexity would be reduced to (2lm+l 2 m). Assuming D = QR then the projection matrix can be written as: D(D t D) 1 D t = QR(R t Q t QR) 1 R t Q t = Q(RR 1 )(R t 1 R t )Q t = QI t Q t, (2) which we use to decrease the projection step s complexity. Table I compares different factorization methods with respect to their complexity. The SSketch s complexity indicates a linear relationship with n and m. In SSketch, computations can be parallelized as the sparse representation can be independently computed for each column of the sub-blocks of A. TABLE I: Complexity of different factorization methods. Algorithm Complexity SVD m 2 n + m 3 SPCA lmn + m 2 n + m 3 SSketch (this paper) n(lm + l 2 m) + mnl 2 2mnl 2 D. Hardware Settings and Results We use Xilinx Virtex-6-XC6VLX24T FPGA ML65 Evaluation Kit as our hardware platform. An Intel core i7-26k processor with SSE4 architecture running on the Windows OS is utilized as our general purpose processing unit hosting the FPGA. In this work, we employ Xilinx standard IP cores for single precision floating-point operations. We used Xilinx ISE 14.6 to synthesize, place, route, and program the FPGA. Table II shows Virtex-6 resource utilization for our heterogeneous architecture. SSketch includes four OMP cores plus an Ethernet interface. For factorizing matrix A m n, there is no specific limitation on the size n due to the streaming nature of SSketch. However, the FPGA block RAM s size is limited. To fit into the RAM, we decide to divide input matrix A to blocks of size m b n where m b and k are set to be less than 256. Note that these parameters are changeable in SSketch API. So, if a designer decides to choose a higher m b or k for any reasons, she can easily modify the parameters according to the application. To corroborate the scalability and practicability of our framework, we use synthetic data with dense (non-sparse)
8 TABLE II: Virtex-6 resource utilization. Used Available Utilization Slice Registers % Slice LUTs % RAM B36E % DSP 48E1s % correlations of different sizes as well as a hyperspectral image dataset [29]. The runtime of SSketch for the different-sized synthetic datasets is reported in Table III, where the total delay of SSketch (T SSketch ) can be expressed as: T SSketch T dictionary learning + T Communication Overhead + T FPGA. (3) Computation As it is shown in Table III, the total delay of SSketch is a linear function of the number of processed samples, which experimentally confirms the scalability of our proposed architecture. According to Table III, the whole system including the dictionary learning process takes 21.29s to process 5K samples where each of them has 256 elements of these samples pass through OMP kernels and each of them requires 89 iterations on average to complete the process. In this experiment, an average throughput of 169Mbps is achieved by SSketch. For both hyperspectral dataset [29] of size and synthetic dense data, our HW/SW co-design approach (with target sparsity level (k) of 128) achieves up to 2 folds speedup compared to the software-only realization on a 3.4 GHz CPU. The average overhead delay for communicating between the processor (host) and accelerator contributes less than 4% to the total delay. TABLE III: SSketch total processing time is linear in terms of the number of processed samples. Size of n T SSketch (l = 128) T SSketch (l = 64) n = 1K 3.635sec 2.31sec n = 5K 21.29sec 12.1sec n = 1K sec 24.32sec n = 2K 9.761sec 48.52sec VIII. CONCLUSION This paper presents SSketch, an automated computing framework for FPGA-based online analysis of big and densely correlated data matrices. SSketch utilizes a novel streaming and communication-minimizing methodology to efficiently capture the sketch of the collection. It adaptively learns and leverages the hybrid structure of the streaming input data to effectively improve the performance. To boost the computational efficiency, SSketch is devised with a novel scalable implementation of OMP on FPGA. Our framework provides designers with a user-friendly API for rapid prototyping and evaluation of an arbitrary matrix-based big data analysis algorithm. We evaluate the method on three different large contemporary datasets. In particular, we compare SSketch runtime to the software realization on a CPU and also report the delay overhead for communicating between the processor (host) and the accelerator. Our evaluations corroborate SSketch scalability and practicability. ACKNOWLEDGMENTS This work was supported in parts by the Office of Naval Research grant (ONR- N ). REFERENCES [1] E. Liberty. Simple and deterministic matrix sketching. Proceedings of the 19th ACM SIGKDD, 213. [2] A. Sergyienko and O. Maslennikov. Implementation of Givens QRdecomposition in FPGA. Parallel Processing and Applied Mathematics, 22. [3] W. Zhang, V. Betz, and J. Rose. Portable and scalable fpgabased acceleration of a direct linear system solver. ICECE Technology, 28. [4] P. Greisen, M. Runo, P. Guillet, S. Heinzle, A. Smolic, H. Kaeslin, and M. Gross. Evaluation and fpga implementation of sparse linear solvers for video processing applications. IEEE TCSVT, 213. [5] L. M. Ledesma-Carrillo, E. Cabal-Yepez, R. de J Romero-Troncoso, et al. Reconfigurable FPGA-Based unit for Singular Value Decomposition of large mxn matrices. ReConFig, 211. [6] S. Rajasekaran and M. Song. A novel scheme for the parallel computation of SVDs. HPCC, Springer Berlin Heidelberg, 26. [7] A. Mirhoseini, E. L. Dyer, E. M. Songhori, R. Baraniuk, and F. Koushanfar. RankMap: A Platform-Aware Framework for Distributed Learning from Dense Datasets. arxiv: , 215. [8] D. C. Montgomery, E. A. Peck, and G. G. Vining. Introduction to linear regression analysis. John Wiley and Sons, 212. [9] M. Andrecut. Fast GPU implementation of sparse signal recovery from random projections. arxiv: , 28. [1] P. Maechler, P. Greisen, N. Felber, and A. Burg, Matching pursuit: Evaluation and implementation for LTE channel estimation, ISCAS, 21. [11] L. Bai, P. Maechler, M. Muehlberghuber, and H. Kaeslin. High-speed compressed sensing reconstruction on FPGA using OMP and AMP. ICECS, 212. [12] A. M. Kulkarni, H. Homayoun, and T. Mohsenin. A parallel and reconfigurable architecture for efficient OMP compressive sensing reconstruction. Proceedings of the 24th edition of the great lakes symposium on VLSI. ACM, 214. [13] Frontiers in Massive Data Analysis, The National Academies Press, 213. [14] K. L. Clarkson and D. P. Woodruff. Numerical linear algebra in the streaming model. Proceedings of the 41th ACM symposium on Theory of computing, 29. [15] G. H Golub and C. Reinsch. Singular value decomposition and least squares solutions. Numerische Mathematik, 197. [16] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics 15.2, 26. [17] D. S. Papailiopoulos, A. G. Dimakis, and S. Korokythakis. Sparse PCA through Low-rank Approximations. arxiv: , 213. [18] E. L. Dyer, A. C. Sankaranarayanan, and R. G. Baraniuk. Greedy feature selection for subspace clustering. JMLR 14.1, 213. [19] J. D. Blanchard and J. Tanner. GPU accelerated greedy algorithms for compressed sensing. Mathematical Programming Computation, 213. [2] Y. Fang, L. Chen, J. Wu, and B. Huang. GPU implementation of orthogonal matching pursuit for compressive sensing. ICPADS, 211. [21] A. Septimus and R. Steinberg, Compressive sampling hardware reconstruction,. ISCAS, 21. [22] J. L. Stanislaus, and T. Mohsenin. Low-complexity FPGA implementation of compressive sensing reconstruction. ICNC, 213. [23] J. L. Stanislaus, and T. Mohsenin. High performance compressive sensing reconstruction hardware with QRD process. ISCAS, 212. [24] [25] K. Marwah, G. Wetzstein, Y. Bando, and R. Raskar. Compressive light field photography using overcomplete dictionaries and optimized projections. ACM TOG, 213. [26] G. Martinsson, A. Gillman, E. Liberty, et al. Randomized methods for computing the Singular Value Decomposition of very large matrices. Workshop on Algorithms for Modern Massive Datasets, 21. [27] A. Plaza, J. Plaza, A. Paz, and S. Sanchez. Parallel hyperspectral image and signal processing. Signal Processing Magazine, IEEE 211. [28] [29] Remote Sensing Scenes [3] Drineas, Petros, and M. Mahoney. On the Nyström method for approximating a Gram matrix for improved kernel-based learning. JMLR 6, 25. [31] R. Tibshirani, Regression shrinkage and selection via the lasso, JSTOR, 1996.
SSketch: An Automated Framework for Streaming Sketch-based Analysis of Big Data on FPGA *
SSketch: An Automated Framework for Streaming Sketch-based Analysis of Big Data on FPGA * Bita Darvish Rouhani, Ebrahim Songhori, Azalia Mirhoseini, and Farinaz Koushanfar Department of ECE, Rice University
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
Secured Embedded Many-Core Accelerator for Big Data Processing
Secured Embedded Many- Accelerator for Big Data Processing Amey Kulkarni PhD Candidate Advisor: Professor Tinoosh Mohsenin Energy Efficient High Performance Computing (EEHPC) Lab University of Maryland,
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Adaptive Stable Additive Methods for Linear Algebraic Calculations
Adaptive Stable Additive Methods for Linear Algebraic Calculations József Smidla, Péter Tar, István Maros University of Pannonia Veszprém, Hungary 4 th of July 204. / 2 József Smidla, Péter Tar, István
Solution of Linear Systems
Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start
FPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace
The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression
The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The SVD is the most generally applicable of the orthogonal-diagonal-orthogonal type matrix decompositions Every
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102
Statistical machine learning, high dimension and big data
Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,
A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster
Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906
CS3220 Lecture Notes: QR factorization and orthogonal transformations
CS3220 Lecture Notes: QR factorization and orthogonal transformations Steve Marschner Cornell University 11 March 2009 In this lecture I ll talk about orthogonal matrices and their properties, discuss
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: [email protected]
How To Design An Image Processing System On A Chip
RAPID PROTOTYPING PLATFORM FOR RECONFIGURABLE IMAGE PROCESSING B.Kovář 1, J. Kloub 1, J. Schier 1, A. Heřmánek 1, P. Zemčík 2, A. Herout 2 (1) Institute of Information Theory and Automation Academy of
Randomized Robust Linear Regression for big data applications
Randomized Robust Linear Regression for big data applications Yannis Kopsinis 1 Dept. of Informatics & Telecommunications, UoA Thursday, Apr 16, 2015 In collaboration with S. Chouvardas, Harris Georgiou,
A numerically adaptive implementation of the simplex method
A numerically adaptive implementation of the simplex method József Smidla, Péter Tar, István Maros Department of Computer Science and Systems Technology University of Pannonia 17th of December 2014. 1
Networking Virtualization Using FPGAs
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,
Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems
Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems Aleksandar Donev Courant Institute, NYU 1 [email protected] 1 Course G63.2010.001 / G22.2420-001,
Compact Representations and Approximations for Compuation in Games
Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions
CONQUEST: A Coarse-Grained Algorithm for Constructing Summaries of Distributed Discrete Datasets 1
Algorithmica (2006) OF1 OF25 DOI: 10.1007/s00453-006-1218-x Algorithmica 2006 Springer Science+Business Media, Inc. CONQUEST: A Coarse-Grained Algorithm for Constructing Summaries of Distributed Discrete
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
High Performance Matrix Inversion with Several GPUs
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
Architecture of distributed network processors: specifics of application in information security systems
Architecture of distributed network processors: specifics of application in information security systems V.Zaborovsky, Politechnical University, Sait-Petersburg, Russia [email protected] 1. Introduction Modern
Fast Multipole Method for particle interactions: an open source parallel library component
Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,
Systolic Computing. Fundamentals
Systolic Computing Fundamentals Motivations for Systolic Processing PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
CHAPTER 1 INTRODUCTION
CHAPTER 1 INTRODUCTION 1.1 Background The command over cloud computing infrastructure is increasing with the growing demands of IT infrastructure during the changed business scenario of the 21 st Century.
Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012
Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering
Greedy Column Subset Selection for Large-scale Data Sets
Knowledge and Information Systems manuscript No. will be inserted by the editor) Greedy Column Subset Selection for Large-scale Data Sets Ahmed K. Farahat Ahmed Elgohary Ali Ghodsi Mohamed S. Kamel Received:
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT
216 ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT *P.Nirmalkumar, **J.Raja Paul Perinbam, @S.Ravi and #B.Rajan *Research Scholar,
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data
Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA
Linear Algebra Review. Vectors
Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka [email protected] http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length
Nonlinear Iterative Partial Least Squares Method
Numerical Methods for Determining Principal Component Analysis Abstract Factors Béchu, S., Richard-Plouet, M., Fernandez, V., Walton, J., and Fairley, N. (2016) Developments in numerical treatments for
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining [email protected] Outline Predictive modeling methodology k-nearest Neighbor
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
Architectures and Platforms
Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation
Scalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
CHAPTER 4: SOFTWARE PART OF RTOS, THE SCHEDULER
CHAPTER 4: SOFTWARE PART OF RTOS, THE SCHEDULER To provide the transparency of the system the user space is implemented in software as Scheduler. Given the sketch of the architecture, a low overhead scheduler
Integer Computation of Image Orthorectification for High Speed Throughput
Integer Computation of Image Orthorectification for High Speed Throughput Paul Sundlie Joseph French Eric Balster Abstract This paper presents an integer-based approach to the orthorectification of aerial
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
Parallel & Distributed Optimization. Based on Mark Schmidt s slides
Parallel & Distributed Optimization Based on Mark Schmidt s slides Motivation behind using parallel & Distributed optimization Performance Computational throughput have increased exponentially in linear
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Invited Applications Paper
Invited Applications Paper - - Thore Graepel Joaquin Quiñonero Candela Thomas Borchert Ralf Herbrich Microsoft Research Ltd., 7 J J Thomson Avenue, Cambridge CB3 0FB, UK [email protected] [email protected]
International Workshop on Field Programmable Logic and Applications, FPL '99
International Workshop on Field Programmable Logic and Applications, FPL '99 DRIVE: An Interpretive Simulation and Visualization Environment for Dynamically Reconægurable Systems? Kiran Bondalapati and
USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS
USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA [email protected]
Accurate and robust image superresolution by neural processing of local image representations
Accurate and robust image superresolution by neural processing of local image representations Carlos Miravet 1,2 and Francisco B. Rodríguez 1 1 Grupo de Neurocomputación Biológica (GNB), Escuela Politécnica
Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff
Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
Interactive Level-Set Deformation On the GPU
Interactive Level-Set Deformation On the GPU Institute for Data Analysis and Visualization University of California, Davis Problem Statement Goal Interactive system for deformable surface manipulation
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation
Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation Ying Peng, Bin Gong, Hui Liu, and Yanxin Zhang School of Computer Science and Technology, Shandong University,
Report: Declarative Machine Learning on MapReduce (SystemML)
Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,
Next Generation Operating Systems
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015 The end of CPU scaling Future computing challenges Power efficiency Performance == parallelism Cisco Confidential 2 Paradox of the
Self-Compressive Approach for Distributed System Monitoring
Self-Compressive Approach for Distributed System Monitoring Akshada T Bhondave Dr D.Y Patil COE Computer Department, Pune University, India Santoshkumar Biradar Assistant Prof. Dr D.Y Patil COE, Computer
How To Write Security Enhanced Linux On Embedded Systems (Es) On A Microsoft Linux 2.2.2 (Amd64) (Amd32) (A Microsoft Microsoft 2.3.2) (For Microsoft) (Or
Security Enhanced Linux on Embedded Systems: a Hardware-accelerated Implementation Leandro Fiorin, Alberto Ferrante Konstantinos Padarnitsas, Francesco Regazzoni University of Lugano Lugano, Switzerland
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
A General Framework for Tracking Objects in a Multi-Camera Environment
A General Framework for Tracking Objects in a Multi-Camera Environment Karlene Nguyen, Gavin Yeung, Soheil Ghiasi, Majid Sarrafzadeh {karlene, gavin, soheil, majid}@cs.ucla.edu Abstract We present a framework
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
Characterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang University of Waterloo [email protected] Joseph L. Hellerstein Google Inc. [email protected] Raouf Boutaba University of Waterloo [email protected]
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
When is missing data recoverable?
When is missing data recoverable? Yin Zhang CAAM Technical Report TR06-15 Department of Computational and Applied Mathematics Rice University, Houston, TX 77005 October, 2006 Abstract Suppose a non-random
Identifying SPAM with Predictive Models
Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to
Sanjeev Kumar. contribute
RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a
Soft Clustering with Projections: PCA, ICA, and Laplacian
1 Soft Clustering with Projections: PCA, ICA, and Laplacian David Gleich and Leonid Zhukov Abstract In this paper we present a comparison of three projection methods that use the eigenvectors of a matrix
FPGA-based MapReduce Framework for Machine Learning
FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China
Exploiting Stateful Inspection of Network Security in Reconfigurable Hardware
Exploiting Stateful Inspection of Network Security in Reconfigurable Hardware Shaomeng Li, Jim Tørresen, Oddvar Søråsen Department of Informatics University of Oslo N-0316 Oslo, Norway {shaomenl, jimtoer,
Mobile Multimedia Meet Cloud: Challenges and Future Directions
Mobile Multimedia Meet Cloud: Challenges and Future Directions Chang Wen Chen State University of New York at Buffalo 1 Outline Mobile multimedia: Convergence and rapid growth Coming of a new era: Cloud
REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])
305 REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc]) (See also General Regulations) Any publication based on work approved for a higher degree should contain a reference
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State
The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip
Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip Ms Lavanya Thunuguntla 1, Saritha Sapa 2 1 Associate Professor, Department of ECE, HITAM, Telangana
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014
Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
Hardware Task Scheduling and Placement in Operating Systems for Dynamically Reconfigurable SoC
Hardware Task Scheduling and Placement in Operating Systems for Dynamically Reconfigurable SoC Yuan-Hsiu Chen and Pao-Ann Hsiung National Chung Cheng University, Chiayi, Taiwan 621, ROC. [email protected]
Operation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point
Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics
INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions
Module 4: Beyond Static Scalar Fields Dynamic Volume Computation and Visualization on the GPU Visualization and Computer Graphics Group University of California, Davis Overview Motivation and applications
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
How To Make Visual Analytics With Big Data Visual
Big-Data Visualization Customizing Computational Methods for Visual Analytics with Big Data Jaegul Choo and Haesun Park Georgia Tech O wing to the complexities and obscurities in large-scale datasets (
