Massive Data Analysis Using Fast I/O


 Sherman Moore
 1 years ago
 Views:
Transcription
1 Abstract Massive Data Analysis Using Fast I/O Da Zheng Department of Computer Science The Johns Hopkins University Baltimore, Maryland, The advance of new I/O technologies has tremendously improved random I/O performance. It raises the question how much the improvement can benefit largescale data analysis and more specifically, whether it can help achieve performance comparable to RAMbased counterparts in largescale data analysis. We developed SAFS, a userspace filesystem optimized for large SSD arrays, and FlashGraph, a semiexternal memory graph analysis framework, to address this question and we specifically target two important data analysis tasks: graph analysis and linear algebra. Our preliminary results show that FlashGraph in the externalmemory mode achieves performance comparable to its inmemory mode and both modes of FlashGraph significantly outperform PowerGraph, a popular inmemory graph engine. Introduction In today s big data era, we face the challenges in both the explosion of data volume and the increasing complexity of data analysis. Experiments, simulations and observations generate data on the order of terabytes or even petabytes in many scientific and business areas. In many cases, the data in such a volume is irregular. For example, a graph that represents a realworld problem has nearly random connections between vertices; some captured datasets have missing data due to imperfect observation. Analyzing irregular data in a large scale causes significant challenges for conventional tools such as SQL databases. The challenges lead to the redesign of data analysis tools. MapReduce [13] is one of the most wellknown tools developed for data analysis in the petabyte scale. However, even MapReduce cannot handle many data analysis tasks efficiently. Graph analysis is one of the wellknown examples that neither SQL databases nor MapReduce can perform efficiently. Nowadays, graph analysis is generally performed in RAM and a large graph has to be processed in a cluster of machines. Many realworld graphs are enormous. For example, the Facebook social network graph has billions of vertices, hundreds of billions of edges; the neural network of a human brain has a fundamental size of vertices and edges; graphs that evolve over time can grow even larger. Vertices in realworld graphs connect with each other nearly randomly, and the vertex degree usually follows the power law distribution. Graph algorithms typically generate many small random accesses and suffer from the lack of data locality and load imbalancing. Therefore, it is difficult to scale graph analysis to many machines. On the other hand, with the advance of new I/O technologies, we observe tremendous improvement in random I/O performance. One example is flashbased memory. The fastest FusionIO and a large SSD array [1] can reach over one million random IOPS and multiple 1
2 gigabytes per second. This is at the same order or only one order of magnitude less than RAM. Today s network technology has also advanced to the throughput similar to disk I/O. The tremendous performance improvement in random I/O provides a good opportunity to enhance largescale data analysis that requires many small random accesses. Given the tremendous hardware improvement, typical questions are: (i) can we replace RAM with fast flash in largescale data analysis? (ii) To what extent can the performance of flashbased data analysis approach that of RAMbased data analysis? If data analysis on flash can achieve performance similar to that in RAM, it will positively affect computer architecture and revolutionize largescale data analysis. Hardware advances impose many new challenges in system design (both the operating system and the data analysis framework). Operating systems were traditionally built with an assumption of slow I/O. There exist significant overhead in almost all layers of the block stack when it operates on fast storage media. For example, a traditional Linux filesystem on top of a large SSD array can only yield a small fraction of the capacity of the SSD array. I/O latency still remains relatively high, usually multiple orders of magnitude larger than RAM. Highspeed I/O consumes significant CPU power and main memory bandwidth. To maximize the overall performance, it is crucial to minimize CPU overhead and memory bandwidth use from I/O. Although the latest I/O devices deliver unprecedented performance, they are still slower than RAM, let alone CPU cache. There are many opportunities to optimize the system to achieve performance better than the raw hardware can deliver. Given all of these challenges, we try to answer the questions with SAFS [1], a userspace filesystem optimized for large SSD arrays, and FlashGraph [2], a semiexternal memory graph analysis framework. SAFS abstracts away the complexity of data access on SSDs and provides an efficient caching layer. To achieve performance comparable to inmemory counterparts, FlashGraph takes advantage of SAFS and issues I/O requests carefully to bring data from SSDs to CPUs efficiently. We follow the six rules below to design and implement SAFS and FlashGraph: Avoid remote memory access: modern multiprocessor systems have nonuniform memory architectures (NUMA) in which regions of memory associate with processors. Accessing remote memory (of another processor) has higher latency, lower bandwidth, and causes overhead and contention on the remote processor. Avoid locking: the main overhead in Linux filesystems come from locks in the page cache, filesystems and device drivers. The locking overhead is amplified in nonuniform memory architectures (NUMA). Reduce random access: although SSDs have good random I/O performance, their sequential I/O performance is still significantly better. Furthermore, sequentializing I/O access helps reduce the number of I/O accesses and reduce CPU overhead for I/O access. 2
3 Reduce data access: SSDs are still several times slower than RAM and they require data access in blocks. Any unused data read from SSDs wastes the bandwidth of SSDs and pollutes memory cache. It is essential to well organize the data on SSDs to minimize the amount of unused data read from SSDs. Maximize cache hit rates: SSDs are still much slower than RAM. Any cache hits improve the applicationperceived I/O performance and boost the overall performance. Overlap I/O and computation: SSDs have latency multiple orders of magnitude longer than RAM. To maximize the I/O performance of SSDs, we need to issue many parallel I/O requests and overlap I/O and computation so that either CPU or SSDs or are saturated. We specifically targets two types of important data analysis on fast I/O: graph analysis and linear algebra, which the national research council identified as two of major massive data analysis tasks [3] in There are various graph forms for analysis. FlashGraph support graph analysis on a single, static graph as well as a temporal graph that evolves over time but is collected and prepared in advance. It will also support streaming graph analysis where changes in a graph are constantly generated and fed to the analysis framework. Furthermore, given the fact that graphs and sparse matrices use the same data representation, we will implement a sparse matrix library on top of FlashGraph to process massive sparse matrices. The architecture of FlashGraph (Figure 1) has three layers: (i) SAFS, (ii) the graph engine, (iii) graph applications. SAFS hides the complexity of accessing a large SSD array and exposes an asynchronous I/O interface to help overlap I/O and computation. The graph engine exposes a general vertexcentric programming interface to help users express graph algorithms in vertex programs. It schedules users vertex programs and issues I/O requests carefully on behalf of vertex programs to maximize the overall performance of the graph applications. Furthermore, it provides a library of common graph algorithms and matrix operations as well as exposing interface for temporal graph analysis. graphalgs library sparse matrix library temporal graph analysis user graph algorithms vertex programs vertexcentric interface graph engine vertex scheduler keyvalue store asynchronous usertask I/O interface SAFS page cache vertex tasks SSD SSD SSD SSD Figure 1. The architecture of FlashGraph 3
4 SAFS SAFS [1] is a userspace filesystem optimized for large SSD arrays. It is designed as a building block for externalmemory data analysis. The goals of the filesystem are: (i) maximize throughput of fast SSDs; (ii) reduce memory overhead in the page cache; (iii) provide a general programming interface to support a large range of data analysis. Maximize the performance of an SSD array A large SSD array is sufficiently fast to consume much CPU computation and thus requires multiple CPU cores or multiple processors to provide sufficient computation power to saturate its bandwidth while performing useful computation. However, traditional operating systems such as Linux were not designed for fast disk I/O in multicore or multiprocessor machines. There are many locks in each component of the I/O stack such as disk drivers, filesystems and page cache. Therefore, a traditional Linux filesystem on top of a large SSD array can only yield a small fraction of its capacity. SAFS is designed to eliminate overhead in the I/O stack, and deliver maximal performance of a large SSD array in a NUMA machine. PerSSD I/O threads To eliminate lock overhead in the I/O stack in the Linux kernel, SAFS uses an I/O thread for each SSD to access data. We expose individual SSDs that connect to a machine to the operating systems, so that we can run a dedicated I/O thread on each SSD. The I/O thread is the only thread that accesses an SSD in the system, which eliminates all lock contention in the disk drivers as well as in the Linux native filesystem. The I/O threads communicate with application threads through message passing to receive I/O requests and send back results. Setassociative cache Although SSDs have advanced to deliver very high I/O throughput, a page cache remains an important optimization to improve overall system performance. SAFS is equipped with a very lightweight parallel page cache [4] to amplify the I/O performance when cache hits exist in the workload. The new page cache shares the same design as the CPU cache. The main idea is to split a page cache into many small page sets and each page set is managed independently, to reduce lock overhead and CPU overhead. Data locality in NUMA Given the nonuniform memory performance in a NUMA machine, we attempt to localize the data access in the page cache to improve memory performance. We partition the page cache and keep a partition in each NUMA node. Thanks to the asynchronous usertask I/O interface, we can send an I/O request with a user task through message passing to the partition where the data resides. Asynchronous usertask I/O interface When using highthroughput highlatency I/O devices, the traditional synchronous I/O interface can no longer saturate these I/O devices with small random I/O access. Applications need to issue many I/O requests in order to maximize the performance of SSDs. We design the asynchronous usertask I/O interface, in which each I/O request issued to SAFS is associated with userdefined computation. Upon the completion of a request, the user computation will be 4
5 performed inside the filesystem. This asynchronous interface allows applications to issue many parallel I/O requests to overlap I/O and computation. Besides overlapping computation and I/O, the asynchronous usertask I/O interface has many other benefits. It enables user computation to access data in the page cache directly. Therefore, it reduces memory copy, avoids memory allocation and reduces memory footprint. Given this I/O interface, we can even bring user computation to the location where data resides, to exploit data locality in a nonuniform memory access (NUMA) machine. FlashGraph FlashGraph [2] is a semiexternal memory graph engine that runs on top of SAFS. It provides a general programming interface to help users develop new algorithms for largescale graph analysis. Like many inmemory graph engines such as Pregel [5] and PowerGraph [6], graph algorithms in FlashGraph are expressed from the perspective of a vertex. That is, each vertex maintains algorithmic state and FlashGraph executes graph algorithms on vertex state in parallel. Unlike these inmemory graph engines, FlashGraph only stores the algorithmic vertex state in memory, and graph data on SSDs, so it has a small memory footprint, which enables us to process massive graphs in a commodity machine with relatively small memory. Externalmemory data representation FlashGraph maintains only vertex state in main memory to minimize memory consumption while storing graph data such as the degrees and the adjacency lists of vertices on SSDs. Since SSDs are still several times slower than RAM, the externalmemory data representation in FlashGraph has to meet the following three goals: (i) be compact to reduce data access from SSDs; (ii) reduce the unnecessary data read from SSDs; (iii) reduce the number of I/O accesses for each vertex read. There are two data structures on SSDs: the adjacency lists of vertices and the index to the adjacency lists. For a directed graph, the inedge list and the outedge list of a vertex are stored separately to optimize for the applications that only require one type of edges. All adjacency lists of vertices are stored on SSDs in the order of their vertex ID. Therefore, the index of a graph only contains the location of the adjacency list of each vertex. For a directed graph, each entry in the index contains both locations of the inedge list and the outedge list of each vertex. I/O optimizations Given the good random I/O performance of SSDs, FlashGraph selectively accesses the vertices required by graph algorithms. In contrast, other externalmemory graph engines such as GraphChi [7] and XStream [8] sequentially access all vertices in each iteration to avoid random I/O access. The selective vertex access is superior to sequentially accessing the entire graph in each iteration. Many graph algorithms only require some of the vertices in a graph so the selective vertex access significantly reduces the amount of data read from SSD. However, it potentially generates many random I/O accesses to SSDs. To maximize the performance of FlashGraph, it is essential to reduce the number of random I/O accesses to SSDs. Each vertex access potentially generates two I/O accesses: one to the index file to find the location and the size of a vertex and the other to the adjacency lists to read the 5
6 adjacency list of a vertex. The optimizations on both types of I/O accesses are the same. A typical graph algorithm accesses many vertices in parallel within an iteration. Therefore, FlashGraph can collect the vertices requested by a graph algorithm and merge I/O accesses into large I/O requests. As a result, an I/O request varies from as small as one page to as large as many megabytes to benefit various graph algorithms. To increase cache hit rate, FlashGraph introduces a runtime 2D partitioning scheme. 2D partitioning breaks up large vertices and a vertex program may only represent part of a large vertex. In each iteration, FlashGraph executes all active vertices in the first partition and then proceed to the second partition and so on. Users have complete freedom to issue vertex requests in each partition of a vertex. A potential use is that a partition of a vertex requests vertices located closely on SSDs to increase the cache hit rate. Triangle counting and scan statistics [9] are two examples that benefit from this scheme. Vertexcentric programming interface In FlashGraph, users implement graph algorithms from the perspective of a vertex. Each vertex maintains some vertex state and acts on its own. It interacts with other vertices with message passing and accesses vertex data through the framework. In this programming environment, users only need to write serial code and FlashGraph executes users code in parallel. Given the relatively long latency of accessing SSDs, FlashGraph uses an eventdriven programming interface to hide the latency of SSDs and overlap graph computation and I/O access. A vertex may receive three types of events: (i) when it receives the adjacency list of a vertex; (ii) when it receives the vertex header of a vertex (which contains the basic information of a vertex such as the vertex degree); (iii) when it receives a message. The second event is a special case of the first one except that retrieving a vertex header only requires reading the index of a graph. Therefore, it significantly improves the performance of graph algorithms that require only the basic information of a vertex. Applications FlashGraph is a comprehensive programming framework for users to develop various graph algorithms and perform graph analysis. In addition, it allows us to develop efficient sparse matrix operations. We also provide a collection of common graph algorithms and sparse matrix operations highly optimized for FlashGraph. The graph algorithm library To demonstrate the expressiveness of FlashGraph and ease graph analysis with FlashGraph, we take a similar approach to igraph [10] and implement some common graph algorithms. The library currently includes PageRank, SSSP, triangle counting, scan statistics [9], weakly connected components, strongly connected components [11], diameter estimation, kcore, spectral clustering. Since FlashGraph targets massive graphs, all graph algorithms implemented in FlashGraph have to have low computation complexity. Temporal graph analysis Realworld graphs may evolve over time. In this setting, we need to perform graph analysis on a 6
7 sequence of graphs. Currently, FlashGraph supports temporal graph analysis when a temporal graph is constructed in advance. A temporal graph in FlashGraph is stored as adjacency lists with each edge associated with a timestamp. The edges of a vertex are sorted with their timestamps to accelerate searching the edge list of a specific timestamp. Currently, we have implemented scan statistics in temporal graphs of millions of vertices and billions of edges. Sparse matrix library Linear algebra is an important tool for many realworld applications. The data in these applications tends to be sparse due to missing data or incomplete observations, so the data is often represented as sparse matrices. A sparse matrix has the same data representation as a graph. Therefore, we represent a sparse matrix as a graph in FlashGraph. We implement sparse matrix vector multiplication in FlashGraph, a basic operation for many linear algebra algorithms, and furthermore, we implement the implicitly restarted Lanzos method [12] to compute top K eigenvalues and eigenvectors on a matrix of billions of nonzero entries. Experimental evaluation We evaluate the performance of SAFS and FlashGraph on a nonuniform memory architecture machine with four Intel Xeon E processors, clocked at 2.2 GHz, and 512 GB memory of DDR The machine has three LSI SAS i host bus adapters (HBA) connected to a SuperMicro storage chassis. The experiments on SAFS use 16 OCZ Vertex 4 SSDs and the experiments on FlashGraph use 12 OCZ Vertex 4 SSDs. Performance of SAFS Figure 2 shows that SAFS read significantly outperforms Linux buffered read with and without page cache on the NUMA machine. In this experiment, we run 32 threads and each thread issues random reads of 128 bytes. Without page cache, SAFS achieves 950K IOPS from the 16 SSDs, the maximal performance we can achieve from the hardware. With page cache but without cache hits, SAFS can still achieve the same I/O performance, which means that the page cache introduces almost no overhead. This is important because many graph algorithms cannot generate many cache hits. When the workloads generate cache hits, the userperceived I/O performance increases with the cache hit rate. On the other hand, Linux buffered read only realizes a small fraction of the raw hardware I/O performance. It gets the same throughput as the raw devices only when the cache hit rate reaches 75%. Performance of FlashGraph FlashGraph in the externalmemory mode has performance comparable to the inmemory mode and both modes of FlashGraph are multiple times faster than PowerGraph in most graph applications on two realworld graphs (as shown in Figure 3). The twitter graph [13] has 42 million vertices and 1.5 billion edges. The subdomain graph [14] has 89 million vertices and 2 billion edges. The runtime of graph applications varies significantly, so we normalize all runtime by that of inmemory FlashGraph. FlashGraph in the externalmemory mode achieves at least half of the performance of its inmemory mode in most applications. In applications such as diameter estimation and weakly connected component, FlashGraph in the externalmemory mode realizes 80% of the performance of its inmemory mode. In some other applications such as breadthfirst search and PageRank, FlashGraph in the externalmemory mode is 7
8 bottlenecked by SSDs, so we expect better performance when FlashGraph is equipped with more SSDs. On the other hand, both modes of FlashGraph significantly outperform PowerGraph. In scan statistics, FlashGraph is 50 times faster than PowerGraph. In most cases, FlashGraph outperforms PowerGraph by an order of magnitude Userperceived IOPS (x1000) Cache hit ratio (%) Linux SAFS w/o cache SAFS w/ cache Figure 2. The performance of SAFS read with and without cache compared with Linux buffer read Normalized runtime twitter(fg3g) subdomain(fg3g) twitterpg subdomainpg Figure 3. The runtime of externalmemory FlashGraph in different graph applications compared with PowerGraph. All runtime is normalized by inmemory FlashGraph. Future work The current implementation of SAFS and FlashGraph runs in a single machine and FlashGraph only supports processing static graphs. The next step is to run SAFS and FlashGraph in distributed memory and support streaming graph analysis. Furthermore, we have demonstrated the potential of FlashGraph in performing sparse matrix operations at a large scale, so we will extend this work and implement efficient sparse matrix/tensor library on top of SAFS and FlashGraph. 8
9 Streaming graph analysis In many cases, graphs are dynamic, where vertices and edges are created and deleted and their attributes may also change constantly. To support streaming graph analysis, FlashGraph needs to process continuous updates on dynamic graphs and generates timely computation results. Ideally, the computation results reflect new updates in a very short time window. A typical method for streaming graph analysis is to generate snapshots of a dynamic graph and run graph algorithms on the snapshots, as demonstrated by Kineograph [15]. This solution is effective and naturally supports many graph applications such as scan statistics on streaming graphs [16]. The size of a snapshot is usually limited by the hardware resources such as RAM and is affected by the requirement of an application. The SSDbased solution allows a much larger snapshot and also many more snapshots. The extension of supporting snapshots requires FlashGraph to adapt its programming interface to support varieties of streaming graph algorithms. It also requires dynamic data management. Even though FlashGraph is able to increase the size and the number of buffered snapshots, it may still not be able to keep up with the rate of incoming updates. In the streaming setting, many algorithms sample incoming updates on a dynamic graph and compute approximation to tackle the limited storage capacity and computation power. The streaming version of FlashGraph will also support these streaming graph algorithms to gain the power of processing heavy incoming updates. Distributed memory The SSDbased graph analysis framework alleviates the problem of having limited memory size for graph analysis. However, we still need more machines to provide higher computation power and the larger storage size to solve larger problems or more computationally expensive applications. Like disk I/O, network technology has advanced significantly. Today, 40Gbps network becomes realistic. The extreme network speed offers the opportunity of linearly scaling up graph analysis but also imposes challenges in system design. Many strategies of optimizing disk I/O can also be applied to network I/O. For example, we need to maximize the network throughput while minimizing CPU overhead for network transmission. Furthermore, we should take advantage of memory cache and maximize the memory cache use. Unlike disk I/O, network I/O has an arbitrary size, which makes memory caching work very differently from disk I/O. Despite the fast network I/O, accessing data in a remote machine still has much higher overhead than accessing data in remote memory of a NUMA machine. Therefore, localizing data access with the asynchronous usertask I/O interface of SAFS will bring more performance improvement in the distributed memory than in a NUMA machine. It will be interesting to investigate all of these issues when scaling FlashGraph up to multiple machines. Sparse matrix/tensor library Since a sparse matrix and a graph have the same data representation, we can implement basic sparse matrix operations on FlashGraph and build a sparse matrix/tensor library with them. Sparse matrix dense vector multiplication is a key to implement many linear algebra algorithms such as computing eigenvalues with the Lanczos method [12] and tensor decomposition [17]. In 9
10 FlashGraph, the sparse matrix is stored on the disks and the vector is stored in memory. Therefore, sparse matrix dense vector multiplication generates only sequential I/O access to disks. This allows us to even replace an SSD array with a large magnetic hard drive array to run matrix operations and achieve extreme scalability. This work will result in very efficient sparse matrix dense vector multiplication and dense matrix operations in external memory as well as their important applications such as nonnegative matrix factorization [18] and tensor decomposition. Related work There are some efforts to build graph analysis frameworks on top of more general data analysis frameworks. For example, PEGASUS [19] is a popular graph processing engine built on Hadoop [20], an opensource implementation of MapReduce [21]. PEGASUS respects the nature of the MapReduce programming paradigm and expresses graph algorithms as a generalized form of sparse matrixvector multiplication. This form of computation works relatively well for graph algorithms such as PageRank and label propagation, but performs poorly for graph traversal algorithms. GraphX [22] is an attempt to bring graph analysis to Spark [23], a general distributed inmemory data analysis framework. It unifies graph analysis pipeline and integrates graph constructions, graph modification and graph computation in a single framework. A graph analysis framework on top of a general data analysis framework is usually less efficient than a dedicated graph analysis framework. There are many research efforts to build distributed graph processing frameworks for largescale graph analysis. Pregel [5] is a distributed graphprocessing framework that allows users to express graph algorithms in vertexcentric programs using bulksynchronous processing (BSP). It abstracts away the complexity of programming in a distributed memory environment and runs users code in parallel in a cluster. Giraph [24] is an opensource implementation of the Pregel programming model. Many distributed graph engines adopt the vertexcentric programming model and express different designs to improve performance. GraphLab [25] and PowerGraph [6] prefer sharedmemory to message passing and provide asynchronous execution. Trinity [26] optimizes message passing by restricting vertex communication to a vertex and its direct neighbors. Several externalmemory graph processing frameworks have been designed to perform largescale graph analysis in a single machine. GraphChi [7] and Xstream [8] are specifically designed for magnetic disks. They eliminate random data access from disks by scanning the entire graph dataset in each iteration. Like graph processing frameworks built on top of MapReduce, they work relatively well for graph algorithms that require computation on all vertices, but share the same limitations i.e., suboptimal graph traversal algorithm performance. TurboGraph [27] is an externalmemory graph engine optimized for SSDs. Like FlashGraph, it also reads vertices selectively and fully overlap I/O and computation. However, it uses much larger I/O requests to read vertices selectively from SSDs due to its externalmemory data representation. Furthermore, it targets graph analysis on a single SSD or a small SSD array and does not aim at achieving performance comparable to inmemory graph engines. 10
11 Galois [28] differentiates itself from others by providing a lowlevel abstraction to construct graph engines. The core of the Galois framework is its novel task scheduler. These concepts are orthogonal to FlashGraph s I/O optimizations and could be adopted. Several other works [29], [30] perform graph analysis linear algebraically, using sparse adjacency matrices and vertex state vectors as data representations. In this abstraction PageRank and label propagation are efficiently expressed as sparse matrix densevector multiplication, and breadthfirst search as sparsematrix sparsevector multiplication. These frameworks target mathematicians and those with the ability to formulate and express their problems in the form of linear algebra. FlashGraph allows a graph algorithm to be expressed both in a customized way and with linear algebra. Some works such as Boost Graph Library [31], Parallel Boost Graph Library [32] and igraph provide a large collection of prewritten graph algorithms. Despite users benefiting from these libraries by invoking existing implementations and using the APIs, these libraries lack a runtime environment within which to express natively parallel algorithms that scale to match very largescale graphs. Several graph engines have been designed for temporal graph analysis and streaming graph analysis. Kineograph [15] is a distributed inmemory graph processing framework for streaming graph analysis. It takes graph update streams and constructs graph snapshots on the fly. It runs graph analysis algorithms on the snapshots to simplify the streaming graph algorithms. Chronos [33] is a graph engine optimized for temporal graph analysis. It explores the timelocality in temporal graphs to achieve significant speedup. The streaming version of FlashGraph will adopt similar ideas to optimize for streaming graph analysis. Conclusion Given the latest advancements in disk I/O and network I/O, we get the opportunity to solve large data analysis problems with efficiency approaching that of inmemory data analysis, but at a much larger scale. We constructed two components: SAFS and FlashGraph. SAFS maximizes I/O throughout of a large SSD array in a large parallel machine while providing an asynchronous usertask I/O interface for general data analysis. FlashGraph is a semiexternal memory graph engine running on top of SAFS to perform general graph analysis and sparse matrix operations. We demonstrate that the SSDbased graph engine can run at the speed comparable to inmemory counterparts. This achievement enables us to process massive datasets in a single machine efficiently without much concern for the limitation of memory size. It also leads us to build cheaper and more energyefficient machines without compromising performance. References 1 D. Zheng, R. Burns, A. S. Szalay. Toward millions of file system IOPS on lowcost, commodity hardware. In International Conference for High Performance Computing, 11
12 Networking, Storage and Analysis ( 2013). 2 Da Zheng, Disa Mhembere, Randal Burns, Alexander S. Szalay. FlashGraph: Processing BillionNode Graphs on an Array of Commodity SSDs arxiv: National Research Council. Frontiers in Massive Data Analysis. The National Academies Press, Washington, DC, Da Zheng, Randal Burns, Alexander S. Szalay. A parallel page cache: IOPS and caching for multicore systems. ( 2012), the 4th USENIX conference on Hot Topics in Storage and File Systems. 5 G. Malewicz, M. H. Austern, A. J.C Bik, J. C. Dehnert, I. Horn, N. Leiser, G. Czajkowski. Pregel: a system for largescale graph processing. In ACM SIGMOD International Conference on Management of data ( 2010). 6 J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin. PowerGraph: distributed graphparallel computation on natural graphs. In the 10th USENIX conference on Operating Systems Design and Implementation ( 2012). 7 A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Largescale graph computation on just a PC. In the 10th USENIX Conference on Operating Systems Design and Implementation ( 2012). 8 A. Roy, I. Mihailovic, and W. Zwaenepoel. Xstream: Edgecentric graph processing using streaming partitions. In the TwentyFourth ACM Symposium on Operating Systems Principles ( 2013). 9 C. E. Priebe, J. C. M. Conroy, D. J. Marchette, and Y. Park. Scan Statistics on Enron Graphs. Computational and Mathematical Organization Theory, 11 (2005), Gabor Csardi, Tamas Nepusz. The igraph software package for complex network research. InterJournal, Complex Systems (2006), S. Hong, N. C. Rodia, and K. Olukotun. On fast parallel detection of strongly connected components (SCC) in smallworld graphs. In the International Conference on High Performance Computing, Networking, Storage and Analysis ( 2013). 12 D. Calvetti, L. Reichel, D. C. Sorensen. An Implicitly Restarted Lanczos Method for Large Symmetric Eigenvalue Problems. ETNA, 2 (1994). 13 H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? ( 2010), the 19th International Conference on World Wide Web. 14 Web graph Raymond Cheng, Ji Hong, Aapo Kyrola, Youshan Miao, Xuetian Weng, Ming Wu, Fan Yang, Lidong Zhou, Feng Zhao, and Enhong Chen. Kineograph: taking the pulse of a fastchanging and connected world. In ACM Eurosys ( 2012). 16 H. Wang, M. Tang, Y. Park, and C. E. Priebe. Locality statistics for anomaly detection in time series of graphs. IEEE Transactions on Signal Processing, 62 (2014). 17 Joon Hee Choi, S. V. N. Vishwanathan. DFacTo: Distributed Factorization of Tensors arxiv: Daniel D. Lee, H. Sebastian Seung. Algorithms for Nonnegative Matrix. ( 2000), NIPS. 12
13 19 U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS: A petascale graph mining system implementation and observations. In the 2009 Ninth IEEE International Conference on Data Mining ( 2009). 20 Hadoop. 21 Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In the 6th Conference on Symposium on Opearting Systems Design & Implementation ( 2004). 22 Reynold Xin, Joseph Gonzalez, Michael Franklin, Ion Stoica. GraphX: A Resilient Distributed Graph System on Spark. ( 2013), GRADES (SIGMOD workshop). 23 Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. Resilient Distributed Datasets: A FaultTolerant Abstraction for InMemory Cluster Computing. In NSDI ( 2012). 24 Giraph. 25 Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning and data mining in the cloud. In Proc. VLDB Endow. ( 2012). 26 B. Shao, H. Wang, and Y. Li. Trinity: A distributed graph engine on a memory cloud. In the 2013 ACM SIGMOD International Conference on Management of Data ( 2013). 27 WookShin Han, Sangyeon Lee, Kyungyeol Park, JeongHoon Lee, MinSoo Kim, Jinha Kim, Hwanjo Yu. TurboGraph: a fast parallel graph engine handling billionscale graphs in a single PC. the 19th ACM SIGKDD international conference on Knowledge discovery and data mining D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In the TwentyFourth ACM Symposium on Operating Systems Principles ( 2013). 29 Jeremy Kepner, John Gilbert. Graph Algorithms in the Language of Linear Algebra. Society for Industrial & Applied Mathematics, A. Lugowski, D. Alber, A. Bulu, J. Gilbert, S. Reinhardt, Y. Teng, and A. Waranis. A Flexible OpenSource Toolbox for Scalable Complex Graph Analysis. In SIAM Conference on Data Mining (SDM) ( 2012). 31 Boost Graph Library. 32 Douglas Gregor, Andrew Lumsdaine. The parallel BGL: A generic library for distributed graph computations. In Parallel ObjectOriented Scientific Computing (POOSC) ( 2005). 33 Wentao Han, Youshan Miao, Kaiwei Li, Ming Wu, Fan Yang, Lidong Zhou, Vijayan Prabhakaran, Wenguang Chen, Enhong Chen. Chronos: A Graph Engine for Temporal Graph Analysis. ( 2014), EuroSys. 13
How Well do GraphProcessing Platforms Perform? An Empirical Performance Evaluation and Analysis
How Well do GraphProcessing Platforms Perform? An Empirical Performance Evaluation and Analysis Yong Guo, Marcin Biczak, Ana Lucia Varbanescu, Alexandru Iosup, Claudio Martella and Theodore L. Willke
More informationTrinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory Cloud Bin Shao Microsoft Research Asia Beijing, China binshao@microsoft.com Haixun Wang Microsoft Research Asia Beijing, China haixunw@microsoft.com Yatao
More informationAn Evaluation Study of BigData Frameworks for Graph Processing
1 An Evaluation Study of BigData Frameworks for Graph Processing Benedikt Elser, Alberto Montresor Università degli Studi di Trento, Italy {elser montresor}@disi.unitn.it Abstract When Google first introduced
More informationMizan: A System for Dynamic Load Balancing in Largescale Graph Processing
: A System for Dynamic Load Balancing in Largescale Graph Processing Zuhair Khayyat Karim Awara Amani Alonazi Hani Jamjoom Dan Williams Panos Kalnis King Abdullah University of Science and Technology,
More informationA survey on platforms for big data analytics
Singh and Reddy Journal of Big Data 2014, 1:8 SURVEY PAPER Open Access A survey on platforms for big data analytics Dilpreet Singh and Chandan K Reddy * * Correspondence: reddy@cs.wayne.edu Department
More informationThe Need to Consider Hardware Selection when Designing Big Data Applications Supported by Metadata
The Need to Consider Hardware Selection when Designing Big Data Applications Supported by Metadata Nathan Regola Department of Computer Science and Engineering and Interdisciplinary Center for Network
More informationScalability! But at what COST?
Scalability! But at what COST? Frank McSherry Michael Isard Derek G. Murray Unaffiliated Microsoft Research Unaffiliated Abstract We offer a new metric for big data platforms, COST, or the Configuration
More informationWTF: The Who to Follow Service at Twitter
WTF: The Who to Follow Service at Twitter Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh Twitter, Inc. @pankaj @ashishgoel @lintool @aneeshs @dongwang218 @reza_zadeh ABSTRACT
More informationBig Data Analytics with Small Footprint: Squaring the Cloud
Big Data Analytics with Small Footprint: Squaring the Cloud John Canny Huasha Zhao University of California Berkeley, CA 94720 jfc@cs.berkeley.edu, hzhao@cs.berkeley.edu ABSTRACT This paper describes the
More informationSolving Big Data Challenges for Enterprise Application Performance Management
Solving Big Data Challenges for Enterprise Application Performance Management Tilmann Rabl Middleware Systems Research Group University of Toronto, Canada tilmann@msrg.utoronto.ca Sergio Gómez Villamor
More informationE 3 : an Elastic Execution Engine for Scalable Data Processing
Invited Paper E 3 : an Elastic Execution Engine for Scalable Data Processing Gang Chen 1 Ke Chen 1 Dawei Jiang 2 Beng Chin Ooi 2 Lei Shi 2 Hoang Tam Vo 2,a) Sai Wu 2 Received: July 8, 2011, Accepted: October
More informationFast InMemory Triangle Listing for Large RealWorld Graphs
Fast InMemory Triangle Listing for Large RealWorld Graphs ABSTRACT Martin Sevenich Oracle Labs martin.sevenich@oracle.com Adam Welc Oracle Labs adam.welc@oracle.com Triangle listing, or identifying all
More informationBig Data Processing in Cloud Computing Environments
2012 International Symposium on Pervasive Systems, Algorithms and Networks Big Data Processing in Cloud Computing Environments Changqing Ji,YuLi, Wenming Qiu, Uchechukwu Awada, Keqiu Li College of Information
More informationMesos: A Platform for FineGrained Resource Sharing in the Data Center
: A Platform for FineGrained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica University of California,
More informationTupleware: Big Data, Big Analytics, Small Clusters
Tupleware: Big Data, Big Analytics, Small Clusters Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Ugur Cetintemel, Stan Zdonik Department of Computer Science, Brown University {crottyan, agg,
More informationLargeScale Image Classification using High Performance Clustering
LargeScale Image Classification using High Performance Clustering Bingjing Zhang, Judy Qiu, Stefan Lee, David Crandall Department of Computer Science and Informatics Indiana University, Bloomington Abstract.
More informationES 2 : A Cloud Data Storage System for Supporting Both OLTP and OLAP
ES 2 : A Cloud Data Storage System for Supporting Both OLTP and OLAP Yu Cao, Chun Chen,FeiGuo, Dawei Jiang,YutingLin, Beng Chin Ooi, Hoang Tam Vo,SaiWu, Quanqing Xu School of Computing, National University
More informationArchitecture of a Database System
Foundations and Trends R in Databases Vol. 1, No. 2 (2007) 141 259 c 2007 J. M. Hellerstein, M. Stonebraker and J. Hamilton DOI: 10.1561/1900000002 Architecture of a Database System Joseph M. Hellerstein
More informationScaling Datalog for Machine Learning on Big Data
Scaling Datalog for Machine Learning on Big Data Yingyi Bu, Vinayak Borkar, Michael J. Carey University of California, Irvine Joshua Rosen, Neoklis Polyzotis University of California, Santa Cruz Tyson
More informationBlueDBM: An Appliance for Big Data Analytics
BlueDBM: An Appliance for Big Data Analytics SangWoo Jun Ming Liu Sungjin Lee Jamey Hicks John Ankcorn Myron King Shuotao Xu Arvind Department of Electrical Engineering and Computer Science Massachusetts
More informationAn Architecture for Fast and General Data Processing on Large Clusters
An Architecture for Fast and General Data Processing on Large Clusters Matei Zaharia Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS201412
More informationA Passive Network Appliance for RealTime Network Monitoring
A Passive Network Appliance for RealTime Network Monitoring Michael J. Schultz, Ben Wun, and Patrick Crowley mjschultz@wustl.edu, bw6@cse.wustl.edu, pcrowley@wustl.edu Washington University in Saint Louis
More informationStreaming Similarity Search over one Billion Tweets using Parallel LocalitySensitive Hashing
Streaming Similarity Search over one Billion Tweets using Parallel LocalitySensitive Hashing Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Madden and Pradeep
More informationAROM: Processing Big Data With Data Flow Graphs and Functional Programming
AROM: Processing Big Data With Data Flow Graphs and Functional Programming NamLuc Tran and Sabri Skhiri Euranova R&D Belgium Email: {namluc.tran, sabri.skhiri@euranova.eu Arthur Lesuisse and Esteban Zimányi
More informationA HighThroughput InMemory Index, Durable on Flashbased SSD
A HighThroughput InMemory Index, Durable on Flashbased SSD Insights into the Winning Solution of the SIGMOD Programming Contest 2011 Thomas Kissinger, Benjamin Schlegel, Matthias Boehm, Dirk Habich,
More informationMore Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger,
More informationProviding a Cloud Network Infrastructure on a Supercomputer
Providing a Cloud Network Infrastructure on a Supercomputer Jonathan Appavoo Volkmar Uhlig Jan Stoess Amos Waterland Bryan Rosenburg Robert Wisniewski Dilma Da Silva Eric Van Hensbergen Udo Steinberg Boston
More informationWorkloadAware Database Monitoring and Consolidation
WorkloadAware Database Monitoring and Consolidation Carlo Curino curino@mit.edu Evan P. C. Jones evanj@mit.edu Samuel Madden madden@csail.mit.edu Hari Balakrishnan hari@csail.mit.edu ABSTRACT In most
More informationepic: an Extensible and Scalable System for Processing Big Data
: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen #, Beng Chin Ooi, Kian Lee Tan, Sai Wu # School of Computing, National University of Singapore # College of Computer Science
More informationAnatomy of a Database System
Anatomy of a Database System Joseph M. Hellerstein and Michael Stonebraker 1 Introduction Database Management Systems (DBMSs) are complex, missioncritical pieces of software. Today s DBMSs are based on
More information