JBösen: A Java Bounded Asynchronous Key-Value Store for Big ML Yihua Fang
|
|
|
- Kenneth Porter
- 10 years ago
- Views:
Transcription
1 JBösen: A Java Bounded Asynchronous Key-Value Store for Big ML Yihua Fang CMU-CS July 21st 2015 School of Computer Science Carnegie Mellon University Pittsburgh, PA Thesis Committee: Eric P. Xing, Chair Kayvon Fatahalian Submitted in partial fulfillment of the requirements for the degree of Master of Science. Copyright c 2015 Yihua Fang
2 Keywords: Distributed System, Machine Learning, Parameter Server, Stale Synchronous Parallel Consistency Model, Big Data, Big Model, Big ML, Data-Parallelism
3 Abstract To effectively use distributed systems in Machine Learning (ML) applications, practitioners are required to possess a considerable amount of expertise in the area. Although highly abstracted distributed system frameworks such as Hadoop can help to reduce the complexity of writing code for distributed systems, their performances are incomparable to that of specialized implementations. Other efforts such as Spark and GraphLab each has it s own downsides. In light of this observation, the Petuum Project is indented to provide a new framework for implementing highly efficient distributed ML applications through a high-level programming interface. JBösen is a Java key value store system in Petuum using the Parameter Server (PS) paradigm and it aims to extend the Petuum project to Java as almost a quarter of the programmer population use Java. It provides an iterative-convergent programming model that covers a wide range of ML algorithms and an easy-to-use programming interface. JBösen, unlike other platforms, exploits the error tolerance property of ML algorithms to improve performance with a Stale Synchronous Parallel (SSP) consistency model to relax the overall consistency of the system by allowing the workers to access older and more staled values.
4 iv
5 Contents 1 Introduction 1 2 Key Ideas Overview Iterative Convergent ML programming model Parameter Server Stale Synchronous Parallel Consistency Model SSP Push Consistency Model JBösen System Design Table Interface Clients Application Threads Back Ground Threads Consistency Controllers Ensuring SSP consistency SSP Push consistency Server Row Requests Push Rows Name Node Network Integration with YARN JBösen Programming Interface and APIs Complex APIs Initialize PsTableGroup Creating tables Spawning application threads Shutdown JBösen PS Application Initialize runworkerthread Application Matrix Fact The Row Interface and the User Defined Row Type v
6 5 Performances 19 6 Conclusion and Future Work 21 Appendices 23 A Complete Row Interface 25 A.1 Row Update A.2 Row Create Factory B Complete PsApplication Interface 29 C Matrix Factorization implemented Using JBösen APIs 31 Bibliography 35 vi
7 List of Figures 2.1 Data Parallelism Demonstration of the SSP consistency. Worker threads cannot be more than s iterations apart where s is user defined JBösen overall architecture Client Architecture Left: JBösen scaling performance using MF application. Right: JBösen speedup using MF application vii
8 viii
9 Chapter 1 Introduction In recent years, Machine Learning (ML) practitioners are turning to distributed systems to satisfy the growing demand of computational power in ML applications. As we advanced into an age of Big Data (terabytes to petabytes of data) and Big Model (well beyond billions of parameters), a single machine can no longer solve ML problems in an efficient and timely manner. However, to implement ML algorithms on top of a distributed system is proven to be no easy task. One needs to possess a considerable amount of knowledge in both distributed systems and ML in order to write an efficient piece of distributed ML application code. In today s world, there is a only a small percentage of ML practitioners can fit in this description. The apparent solution is an abstracted distributed platform that hides away the difficult system work to allow users to focus on ML algorithms. In the past, there have been efforts to build highly abstract general purpose distributed frameworks for ML, but these existing systems present a variety of trade-offs on efficiency, correctness, programmability, and generality[9]. Hadoop [8] is an popular example of such a platform with an easy-to-use MapReduce programming model. Yet, the MapReduce programming model makes it difficult to optimize using many properties that are specific to ML algorithms such as the iterative nature, and its performance is not competitive to other existing systems. Spark [10], while remaining the scalability and fault tolerance of MapReduce, is an alternative platform to Hadoop that is optimized for iterative ML algorithms. What Spark does not offer in the platform is the fine-grained scheduling of computation and communication which is considered to be essential for a fast and correct implementation of distributed ML algorithms [1]. GraphLab [4] and other graph-oriented platforms distribute workload through clever partitioning of the graph based models, but many advanced ML algorithms cannot be easily or efficiently represented in graphs such as topic modeling. The remaining category of systems [3, 6] offers powerful and versatile lowlevel programming interface but do not offer an easy-to-use building blocks for implementing many ML algorithms. The Petuum project [9] aims to address this issue by providing a new framework for implementing highly efficient distributed machine learning applications through a high-level programming interface. The framework proposes that a wide spectrum of ML algorithm can be represented by an iterative convergent [9] programming model, treating the ML algorithm as a series of iterative updates to the ML model. In this way, the system exploits two types of parallelism: (1) data parallelism - partition and distribute the training data across a cluster of 1
10 machines; (2) model partitioning - models are partitioned and assigned to different machines and then are updated in parallel. The system also exploits a statistical properties that is specific to ML algorithms: error tolerance - ML algorithms can tolerate limited amount of inconsistency in intermediate calculations. It takes advantages of the error tolerance property through a relaxed consistency model, a Stale Synchronous Parallel (SSP) consistency model [7] and it s variant SSP Push consistency model. [7] shows that the SSP and SSP Push consistency model may improves significantly the throughput of the system. The key component in the Petuum project to realize the above described designs and optimization is a parameter server [3] system, named Bösen, for accessing the shared model parameters from any machine through a key-value storage APIs that resembles single machine programming interfaces. What is missing from the Petuum project, however, is a Java PS system as Bösen is implemented in C++. With figures from a variety of unofficial sources suggested on [5], it is estimated that there are between eight to ten million Java programmers in the world today. In addition, there are far more ML researchers and data scientists who are familiar with Java than those who are familiar with C++ which is required to use Bösen. In light of this observation, we present in this thesis a Java version of the PS system, JBösen 1 for the Petuum project. While we incorporated all of the above mentioned design points from Bösen into the JBösen, we implemented JBösen independent of Bösen as a stand alone system. We begin by a theoretical explanation in details of the design points JBösen incorporated. We then illustrate how different components of the JBösen come together to realize the aforementioned design ideas. Followed after, we demonstrate how a ML algorithm can be implemented using JBösen s interfaces as an iterative convergent programming model. In this thesis, we use a matrix factorization algorithm using Stochastic Gradient Descent (SGD) as demonstration, but the Petuum ML library also includes more algorithms on JBösen not explored in this paper as well as more are planned in the future. 1 JBösen is available as part of the Petuum project open sourced on 2
11 Chapter 2 Key Ideas Overview We begin with a theoretical discussion on the key ideas used in the JBösen system, which differentiate JBösen from other distributed ML frameworks. These ideas are manifested as the fundamental features of JBösen and each exploits some unique properties of ML algorithms. We start by illustrating the iterative convergence programming model in representing a wide range of ML algorithms with data parallelism. Then we discuss how the parameter server architecture is suitable for implementing data parallelism. Following that, we explain how SSP and SSP Push consistency model can exploit the error tolerance nature of ML algorithms to improve throughput of the system. 2.1 Iterative Convergent ML programming model [9] formalized the Iterative Convergent ML programming model as executing a update function iteratively until the model state reaches some state matching some stopping criteria. Mathematically, it can be expressed as A t = F(A t 1, (A t 1,D)) where t denotes the iteration, A denote the model state and given the training data D. The update function () which improves a loss function performs computation on the training data D and model state A and the results are aggregated by some function F. In the context of the data parallelism, the iterative convergent programming model becomes: A t = F(A t 1, p = 1 P (A t 1,D p )) where P is the number of partitions or the number of workers. The training data is partitioned and assigned to each of the workers and the intermediate results computed by each of the workers are aggregated together before applying to the model parameters. The important observation is the property of additive updates where updates () are associative and commutative and are allowed to be aggregated through summation both locally and over the network. This property also allows 3
12 Figure 2.1: Data Parallelism that no matter which partition of the training data is assigned to a particular worker, the worker can in a statistical sense contribute to the convergence of a ML algorithm via () equally [9]. This is commonly assumed in many ML applications. 2.2 Parameter Server The primary challenge to the data parallelism approach is how to design the software system to share a gigantic ML model across all nodes in a cluster with correctness, performance and easy to program interfaces at the same time. The parameter server is a popular paradigm that ML practitioners have been turning to in recent years[1]. Without limiting to a specific implementation, the concept of the Parameter Server is a shared key-value storage with a centralized storage model and an easy to use programming model. It is intended as a software system to share the model parameters in ML applications across a cluster of commodity machines. The nodes are partitioned into two classes of nodes. The client nodes perform the bulk part of the computation for the ML algorithm. Usually, each client node will retain a portion of the training data and perform statistical computation such as Stochastic Gradient Descent (SGD) to train the shared model. The server nodes are responsible for synchronizing parameters of the shared model to provide a global view of the shared ML model to all clients. The model is partitioned and assigned to different servers and therefore updates to the model can be processed in parallel. The clients communicate with the server nodes to retrieve or update the shared parameter values. 2.3 Stale Synchronous Parallel Consistency Model ML applications possess the iterative-convergence characteristics where the algorithm has a limited error-tolerance in the parameter state. In another word, the ML parameters at each iterations need not to use the most up to date parameters for computations. This presents an unique opportunity for optimization by relaxing the consistency model of the software system to achieve higher throughput compared to a strong consistency model. The SSP consistency model is based on this observation that the ML parameters are allowed to be stale and it aims to maximizes the 4
13 time each worker spends on useful computation (versus communication with the server) while still providing the algorithm correctness guarantees. Figure 2.2: Demonstration of the SSP consistency. Worker threads cannot be more than s iterations apart where s is user defined. In SSP, each worker make an update ε i to an parameter where the update is associative and commutative. Hence, the freshest version of the parameter value δ is a sum of updates from all workers on δ. δ = ε i + δ where δ denote the parameter from last iteration i When a worker asks for the δ, the SSP consistency model allows the server to present to the clients a staled version of δ that may or may not included all of the updates up to a limit. As the example in 2.3, the limit of the staleness of the parameters can be user defined as a staleness threshold. In this way, we limit the system s staleness to a bounded value, the staleness threshold. In another word, informally, if c is the iteration of the fastest client in the system, it may not make further progress until all other clients are progressed to at least c s 1 iterations, where s is some user defined threshold. In this way, by using SSP, we relaxed the consistency of the entire system. The clients can perform more computations instead of spending time waiting for fresh parameters from the servers. [1] presents a formal proof of the correctness of the SSP consistency model. 2.4 SSP Push Consistency Model SSP Push consistency model is a variant of the SSP consistency model. In SSP, whenever the clients are in need of a more fresh parameter, they will request the parameter from the servers and be forced to wait for the responses from the servers. In SSP Push, the servers, instead of waiting for the clients to request parameter values, will try to push the updated parameters to the clients. If a client requested the parameter before, the server will push the parameters to the client after collecting all updates from other clients. This is taking the advantage that if the client access a parameter, it is likely that it will access it again in future iterations. In this way, we reduce the time needed for clients to wait for the requests to come back. The other benefit of SSP Push is improving the convergence rate of ML algorithms because the the average staleness of the parameters used in computation is significantly reduced [1]. 5
14 6
15 Chapter 3 JBösen System Design The aims of the JBösen system are a distributed ML platforms that allows ML practitioners to and to exploit unique properties Figure 3.1: JBösen overall architecture of ML algorithms and to write data parallel ML programs that scales to Big Data and Big Model in Java. It offers to users a table-like interface with key-value storage APIs for ease of programming. It follows a parameter server paradigm which partition the the system into two classes of nodes, clients and servers. The clients are responsible for the ML computations and the servers are responsible for synchronizing and share model parameters across the cluster. By adopting a consistency controller, the system implements the SSP and it s variant SSP Push consistency model and frees users from explicit network synchronizations. To further the usability of the JBösen system, we integrated the system with YARN/Hadoop environment. YARN is a popular resource manager and scheduler in the Hadoop environment and has become quite popular among a large scale clusters. Therefore, from a usability stand point, it is useful to integrate the JBösen with the YARN. 3.1 Table Interface The JBösen provides a table-like interface for the shared parameters. Each shared parameter in a ML model is represented as a row ID and column ID. Rows are the minimum unit of communication between the servers and the clients. The rows are implemented in two different types: dense and sparse. The dense row is implemented using arrays where each row ID and column ID is present in the memory. The sparse row is implemented as map where only non-zero row ID and column ID pair exist in memory. The rows types also comes with different value size such as Integer vs Long and Float vs Double. The differentiation is designed to be flexible for users to tailor to their own application needs. More about row interface is explained in 4.4. In the table interface, each application thread in the client is regarded as a worker by the system. The application threads execute the computation and access the shared parameters stored 7
16 in the table through a key value store interface via GET and INC. In context of the JBösen, the GET will obtain a row while INC will update the values of a row. Once the application threads complete an iteration, it notifies the system via CLOCK, which increment the clock of the application thread clock by 1. The clock is used to keep track of the progress of the system where clock c represent iteration c. The SSP consistency model will satisfy the READ request generated at clock c will observe all updates from 0 upto c s 1 where s is the staleness threshold. 3.2 Clients The clients consists of four different components: application threads, background threads (JBösen system threads on the client), Figure 3.2: Client Architecture the consistency controller and process cache storage. The application threads contains the ML algorithm logic which will generate requests to the background threads through the JBösen interface. The background threads are responsible for communication between clients and servers, and between application threads, maintaining the process cache storage using data from the servers and book-keeping with the servers. The background threads use a consistency model controller to control the staleness of data based on SSP and free the application threads from synchronizing explicitly Application Threads The application threads are user-implemented application logic with data partitions. The application thread would first load the training data from either memory, from disk or over a distributed file-system such as HDFS or NFS. The data loading is intentionally left to the application programmers since Java has native supports to most of the storage system. Then, the application threads can touch the data in any order desired by the application programmers based on the implemented algorithm. The application can choose to sample one data point at a time or choose to pass through all data points in one iteration. At the end of iteration, updates are pushed to the servers and the parameters are synchronized with the servers Back Ground Threads The background threads are the system threads on the client nodes and they have three functions. The first function of the background thread is to handle the communications within clients and between clients and servers. The application threads communicate with the background threads through network messages using intra-process channels. Each of the messages represents an instructions to the background threads. The background threads, upon receiving the 8
17 instruction, will sent the appropriate network messages to the servers. The response from the server arrives at background threads first. The background threads would handle the message for any book keeping and then respond to the application threads through intra-process channels. The second function of the background thread is to handle the network synchronizations with the servers. Note, the JBösen API only offers GET and INC methods in the interfaces. The network synchronizations with the servers are done in the background threads abstracted away from the users and the synchronization happens each time the system clock advance. The system clock advances when all application threads that the background thread is responsible for advanced to a new clock. In another word, the system clock is defined as the minimum clock for a group of application threads and advances when the minimum clock advanced by 1. Each time the system clock advances, the background threads will send the updates in this clock along with the clock information to the servers. The third function of the background thread is to maintain a process level cache for all the application threads on a node and maintain the staleness of the data based on the SSP consistency model. Whenever, servers send parameter data to the clients, the background threads would update the process cache storage. This can happen when both clients request for rows from the servers or the servers push row data to the clients. When the application threads requests for rows, the background thread uses a consistency controller to first checks to make sure to only request rows from server if the cached rows are too staled based on the SSP consistency. 3.3 Consistency Controllers The JBösen implements both SSP and SSP Push consistency models through a consistency controller. The main purpose of the consistency controller is to manage the data supplied to the application threads GET request to comply with the staleness requirements of the SSP consistency model Ensuring SSP consistency The clients in the JBösen system cache previously accessed rows via the process cache storage on each client. When the application thread issues a GET request, it first check the local cache for the requested rows. If the rows is not found in the local cache, a request is sent to the server to get the rows. In addition, each row in the process cache storage is associated with a clock. A row with clock c means that updates generate by all workers before clock c has been applied. If the row has a clock c and c > c p s where c p is the worker s clock and s is the staleness threshold, then the row is considered fresh and will be returned to the worker. Otherwise, the row is considered to be too stale and the worker will send a request to server and block until the server send a fresh row SSP Push consistency After each advance to the system clock, the updates are sent to the server. Since the updates are associative and commutative, the order of which are applied to the rows are not a concern. 9
18 The servers will then push the updated parameters through call-backs. When a client request a row for the first time, it registers a call back on the server. As the server clock advances from collecting all workers clock tick, it push the rows to the registered clients. This is different from the SSP consistency where the server passively sends out the updates as clients request them. This happens each time the client does not have a fresh value in its process cache storage. This mechanism exploits the fact that application often revisit the same rows in iterative-convergence algorithms. 3.4 Server The servers are a centralized key-value storage to share Big ML Models through the table interface. We designed the server to partition the table in a simple hash ring fashion where rows are shard across the server machines. The sharding is based on the hashing of the key and in this way each server maintains a portion of the shared tables. There are two primary methods in the server interface, row requests and push rows. To synchronize in application logic, we implemented a light weight name node running on one of the server nodes to provide a globalbarrier() interface for the applications. The global- Barrier() allow the application thread to synchronize globally to the same state. The name node is also used to synchronize both the clients and servers initialization phase Row Requests The client will sent a row request to the server if there is no fresh enough rows in it s local cache based on the SSP consistency. The server, upon receiving the row request, will examine if it has the row in a fresh enough version. If so, the server will send the row to the client immediately. In the case that the server does not have a fresh enough row indicates that some other clients have not been progressed to the server clock, where server clock is defined as the min clock of all clients. For example, if client client 1 requests for a row at clock c 1, and there is another client client 2 currently running at clock c 1 s 1 where s is the staleness threshold, the request to client 1 cannot be satisfied because not all clients have updated the requested row in c 1 s iterations yet. The request will be stored at the server and the client requesting the row will hang to wait for other client to catch up Push Rows At the end of each system clock, the client will send to the server the updates for that iteration. As the result, the server will apply the updates to the rows and maintain the iteration information. When all clients advanced at least 1 clock, the server clock is said to be advanced. Each time the server clock advances, the server will push the current rows to the client if the client have requested the rows or the client have updates for this row in the last iteration. 10
19 3.4.3 Name Node To synchronize the states globally in the JBösen system, we implemented a light weight name node as a central server independent of the servers. In the beginning, the name node is used to synchronizing the initialization of all the components of the JBösen system. At the start, each background thread and each server thread are registered with the name node using a simple hand shake. Then, name node will synchronize the table creation process and signal all components that the system is ready to start. The global barrier is implemented with the name node. The globalbarrier() method would send a global barrier message to the name node and wait for the name node to reply. The name node would wait for the same message to come from all clients before sending off reply to all the clients. Since the name node is quite lightweight, it is reasonable to choose a centralized architecture as a single point of failure. In addition, the name node would run on one of the server nodes which does not affect the performance since the name node is quite light weight. 3.5 Network The network communication of JBösen is implemented using the Netty project and Blocking Queue. The Netty project is a NIO framework that allows easy implementation of network application. It is asynchronous and event-driven which is suitable for highly scaled applications such as JBösen. In the JBösen datapath, the application threads first communicate with the background threads and the background threads would communicate with the servers. The network relationship between application threads and background threads is a many-to-many relationships. Following the PS architecture, each clients will be connected to all the servers and maintain a connection through out the life time of the application. The communication between the back ground threads on the clients and the application threads are done through a blocking queue where each massage is directly queued onto the destination application threads queue. When communication between the server and the client, the Netty threads will automatically received the message and queues onto the destinations queue. Each of the message communicated over the network would have a header associated with it. The reason we need to encode the size of the message in the header is because Netty would fragment the message and the receiver would need to assemble the fragmented message to decode the message. 3.6 Integration with YARN The integration of JBösen into the YARN environment is through writing a YARN application. To run on YARN, we need a YARN client and an application master. The YARN client is responsible for setting up the YARN application master and copying all local files onto HDFS if users do not copy them manually. The node running the application master is logically considered to be separate from the local machine running the client. The application master is responsible for setting up the containers and distribute necessary files. The training data used for the application 11
20 are usually so large that users should move the file to HDFS manually and it is not reasonable to copy data files every time the system runs. The YARN client is much simpler compared to the application master as it consists of mostly boiler plate code from YARN s user manual. One extra task is to use the HDFS APIs to copy files from local file system to HDFS, since program running on YARN can only be assume to have access to HDFS. Note, differ from most YARN applications, we choose to make the client to fail fast in the case the application master failed instead of resubmitting the application master. This will be explained more in sections below. The application master does a few more things extra on top of the boiler plate code. First, since JBösen system needs a host file containing the IP addresses of the cluster, the application master will requests the containers and generate a host file based on the containers obtained. Second, the application master needs to distribute Jar file for JBösen onto containers. YARN has a local resource interface to automatically distribute the files onto the containers, which requires careful set ups. 12
21 Chapter 4 JBösen Programming Interface and APIs The JBösen system provided three main methods in the programming interface for users to express their iterative convergent ML algorithms. As explain in 3.1, the shared model is represented in the JBösen as a table interface with rows and columns. A Get() methods is used to request rows from the table interface, a Inc() methods to alter the values in the table, and last but not least, a clock() method to indicate the iteration progress of the application logic. Below is a basic JBösen program using the three methods in the interface. table = PSTableGroup.getTable(tableId); for (int iter = 0; iter < totaliteration; iter++) { // Request row data from the table row = table.get(rowid); // Using the row data to perform some computation and // compute the changes to the row data. change = compute(row); // Write the changes to the row data. table.inc(rowid, change); // At the end of this iteration, use clock to indicate the advance // of the iteration. PsTableGroup.clock(); The tables are stored on the servers and are identified through tableid. The reference of a table is accessed through the PsTableGroup object and usually obtained after the table creation. The table object contains two methods: table.get(rowid) to read the row data specified by the rowid, and table.inc(rowid, changes) to write the changes to the row specified by the rowid. With these two methods, the JBösen system will automatically ensure the SSP consistency across all components of JBösen and no addition attention is required by the users. There are two different sets of APIs for setting up the JBösen. One interface is to manually set up the JBösen system with many boiler plate function calls that needs to be called in each 13
22 application but gives the programmers a greater flexibility to write their applications. The other is a simpler PS application interface where a lot of the boiler plate code for setting the JBösen is automated. Although the PS Application interface is simper, it imposes a specific programming model for using the APIs. 4.1 Complex APIs In general, in order, a JBösen application needs to: Initialize the global PsTableGroup instance. Use PsTableGroup to create tables that are needed. Spawn worker threads and complete computation. De-initialize the global PsTableGroup instance Initialize PsTableGroup PsTableGroup.init(PsConfig config) function must be called before any other methods from PsTableGroup interface are invoked. The PsConfig class contains several public fields that the application must set: clientid: The ID of this client, from 0, inclusive, to the number of clients, exclusive. This value must be set. hostfile: The absolute path to the host file in the local file system. This value must be set. numlocalworkerthreads: The number of application threads for every client. numlocalcommchannels: The number of background threads per client. The recommended value is one per 4 to 8 worker threads. For example, an application may begin as follows: public static void main(string[] args) { PsConfig config = new PsConfig(); config.clientid = 0; config.hostfile = "/home/username/cluster_machines"; config.numlocalworkerthreads = 16; config.numlocalcommchannels = 2; PsTableGroup.init(config);... 14
23 4.1.2 Creating tables After PsTableGroup has been initialized, it is time to create the distributed tables needed by the application. This is done using the createtable method, which takes several parameters: int tableid: The identifier for this table, which must be unique. This identifier is used later to access this table. int staleness: The SSP staleness parameter, which bounds the asynchronous operation of the table. Briefly, a worker thread that has called PsTableGroup.clock() n times will see all updates by workers that have called PsTableGroup.clock() up to n - staleness times. RowFactory rowfactory: A factory object that produces rows stored by this table. The implementation of the row is up to the application. RowUpdateFactory rowupdatefactory: A factory object that produces row updates stored by this table. Essentially, JBsen tables store Row objects, which are updated by applying RowUpdate objects to them. The implementation of the row update is up to the application. Since implementing these rows and row updates can be a lot of work, JBsen provides some common pre-implemented tables. For example, createdensedoubletable() creates a table that contains rows and row updates that are essentially fixed-length arrays of double values, and applying an update means element-wise addition Spawning application threads After all the necessary tables are created, the application can spawn its worker threads. JBösen expects the number of worker threads to be exactly the number set in PsConfig.numLocalWorkerThreads. To inform JBösen that a thread is a worker thread, it must call PsTableGroup.registerWorkerThread() before accessing any tables. After doing so, the worker can use PsTableGroup.getTable(int tableid) to retrieve the previously created tables, and begin to use them. When the worker has completed running the application, it must call PsTableGroup.deregisterWorkerThread() to inform JBösen. Each worker thread may look something like the following: class WorkerThread extends Thread public void run() { PsTableGroup.registerWorkerThread(); Table table = PsTableGroup.getTable(0);... PsTableGroup.deregisterWorkerThread(); 15
24 4.1.4 Shutdown JBösen When all worker threads have been de-registered, the last thing to do is to shut down JBösen. This is done simply by calling PsTableGroup.shutdown(). 4.2 PS Application The motivation behind the PS Application interface is to hide the boiler plate code in order to simplify the application implementation efforts. In the previous interface, there are quite a few functions that needs to be called in every applications. The init() functions needs to be called before other methods in the PsTableGroup interfaces can be invoked. Each worker thread needs to register and de-register with the system at the beginning and the end of thread execution. The programmers needs to spawn the application threads by themselves instead of letting the system spawning the threads automatically. It is quite reasonable to assume that most application code will launch more than one application threads due to data parallelism. Ultimately, we try to abstract away the system interfaces and allow ML programmers to only focused on the algorithm implementations. The PS Application interface is simplified that all of the aforementioned methods are done by the system instead of the application code. The interface reduced the requirements the users need to implement two functions, initialize() and runworkerthread(int threadid) in the PsApplication class. A complete PsApplication interface can be found in B Initialize In the initialize(), the users are expected to create all the tables needed for the application and initialized any variables that is needed for creating the tables. It is also a good time for the application threads to load the data from the memory, local disk or distributed storage system. For example, a simple application that creates one table would look like public void initialize() { int rowcapacity = PsTableGroup.getNumLocalWorkerThreads(); PsTableGroup.createDenseIntTable(CLOCK_TABLE_ID, staleness, rowcapacity); runworkerthread The runworkerthread(int threadid) function contains the application logic. This assumes the application takes a data parallel model approach to partition the training data and computation to each application thread. The input threadid variable is a globally unique threadid. Usually, the application will perform computations using the local training data, push updates at the end 16
25 of iteration by calling PsTableGroup.clock(), and read off updates from the servers to reflect the updates of other clients. Here is an simple example that simply adding 1 to the parameters at each public void runworkerthread(int threadid) {... // Get table by ID. IntTable clocktable = PsTableGroup.getIntTable(CLOCK_TABLE_ID); for (int iter = 0; iter < numiterations; iter++) {... // Increment the clock for this thread. table.inc(clientid, threadid, 1); // Finally, finish this clock. PsTableGroup.clock(); // globalbarrier makes subsequent read all fresh. PsTableGroup.globalBarrier(); // Read updates. After globalbarrier all updates should // be visible. for (int cliid = 0; cliid < numclients; cliid ++) { IntRow row = clocktable.get(cliid); 4.3 Application Matrix Fact We have included in C a Matrix Factorization implementation using the JBösen PsApplication interface to demonstrate the simplicity of the APIs. This implementation runs SGD on the local training data for a number of iterations. 4.4 The Row Interface and the User Defined Row Type The table interface 3.1 used in the JBösen choose rows as the minimum unit of data. In JBösen, we implemented the rows for all the primitive types: Integer, Double, Long and Float, but we also provide a user defined row interface for those who wish to implement their own row type. 17
26 This feature is an extension to the table interface and can greatly expand the programability of the system. For example, consider the following row type where in addition to row types offered in the table interface, there is a value ε associated with each row and the values are automatically set to 0 if the value in the row become between ε and ε. To implement this example with the user defined row feature can require great engineering efforts to work around. To implement a row type, there are four piece of information required, a row type with a row create factory, and a row update type with a row update factory, where the factory interface is used to create the respective row and row update. As explained in the previous sections, the JBösen uses rows and row updates separately where the value of the rows can be considered as the aggregate of a series of row updates. The users therefore is required to implement this aggregation through inc() interface. For example, in all the provided implementation of row type with the primitive types, the inc() aggregation operation is implemented as summation. Note, besides the aggregation, creation, and serialize de-serialize, the row interface does not contain any other methods discussed before in the thesis such as the get() or methods that should be fundamental to the type of operation such as contain() (if a value is contained in the row). This is a design choice deliberately delegated to the users. Conceptually, from the perspective of the JBösen system, the table interface is a collection of rows with rowids that can be aggregated, created, and packed for network communication. To create a table, the interface createtable() takes in only a row factory and a row update factory object. The other functionality of the rows are used in the application logic implemented by the users. In this way, the users have the full control to tailor the row types to their needs. A complete row interface is attached in appendix A. 18
27 Chapter 5 Performances We emphasize that JBösen is for the moment about extending the Petuum Project to Java users. It focuses on incorporating design principles that exploit error tolerance and data parallelism from the existing Petuum project and an easy-to-use programming interface for Java users. Currently, there is only a limited number of applications implemented on top of the JBösen system, but more applications are coming in the near future. To demonstrate the performance potential of the JBösen system, we use the existing Matrix Factorization (MF) application implemented on top of JBösen. Figure 5.1: Left: JBösen scaling performance using MF application. Right: JBösen speedup using MF application We demonstrate the scale performance of the MF application implemented on top of JBösen. In Figure 5.1, we demonstrate using 1, 2, and 4 nodes clusters running a Netflix Rating dataset, the application achieved 3.8 speedup with 4 nodes and 1.9 speedup with 2 nodes. This result shows that the system has the potential to scale well with increasing number of machines. The testing cluster we used is a 4 nodes cluster, each machine with 16 Intel 2.1 GHz cores, 128 GB RAM and 10Gbps Ethernet. We use the Netflix Ratings dataset with a matrix and 100,480,507 ratings. For cross-platform comparisons, we did not find a fair benchmark for the JBösen, although we compared the performance to an older version of the Petuum ML application on top of Bösen since it incorporates similar design principles such as PS and SSP. The ML application on top of JBösen performed around 5x faster and converged around 1.5x faster using the same Netflix 19
28 Ratings dataset and the 4 nodes cluster, but this result is not a rigorous comparison because the newest release of Bösen includes more improvements and performance optimization. We could not use the newest release for comparison because the most current Petuum MF application is implemented on top of another system STRADS [2] under the Petumm project, not Bösen. With more applications implemented on top of JBösen down the road, we will be able to evaluate the performance of the system in a more rigorous manner. We will have more benchmark to compare to whereas currently it s hard to find a fair comparison. We will also be able to evaluate the performance of the system separately from the performance of the application implementations. 20
29 Chapter 6 Conclusion and Future Work The key contribution of this thesis is a Java version of the bounded-asynchronous key-value store system for distributed ML algorithms with Big Data. It provides a simple programming model that covers a range of ML algorithms focused on simple programming interfaces instead of performance; it exploits data parallelism for Big Data using the parameter server architecture; it exploits the error tolerance nature of the ML algorithms. As more than 25% of programmers in this world use Java, JBösen is a great contribution to the Petuum project as it expand the project to reach these Java programmers. Our work is just the beginning of the JBösen project. It has many potentials to improve and expand in terms of both performance and functionality. The primary plans in the near future is to expand the library of ML applications implemented on JBösen. One of the vision of the Petuum project is to provide a complete software stack for Big ML, from the system to out-of-box ML applications. JBösen, as part of the Petuum project, shares the same vision. There have already been plans for a new Machine Learning Ranking and Latent Dirichlet Allocation application on JBösen and more will come soon. JBösen will no doubt continue to improve on both its performance optimization and usability. We have experimented new optimization with a proof of concept implemented on top of Bösen, since JBösen at the time are still being developed. Under the Petuum project, based off on the idea of model parallelism, there is a model parallel framework, STRADS [2], designed to ensure efficient and correct model parallel execution of ML algorithms. While the Bösen allows ML algorithms to be expressed as a series of iterative updates to a shared model, STRADS allows these parameter updates to be scheduled by discovering and leveraging structure properties of ML algorithms. In the near future, there are plans to integrate these two systems together as their design goals complement each other. In anticipating the plan, we explored a derivation of the SSP Push consistency model where we use schedule information to improve the network performance of the system. Finally, the other aspect of JBösen we did not discuss in this thesis is fault tolerant with autorecovery. Currently, JBösen does not offer any auto-recovery features for node failure which is considered to be essential for systems to scale beyond thousands of machines. The emphasis of the JBösen and by extension the Petuum project is, for the moment, largely to allow ML practitioners to implement and experiment with new ML algorithms on small to medium clusters. In fact, the infrastructure already exists for JBösen to build simple fault tolerance. The Clients 21
30 is stateless with training data unchanged throughout the lifetime of the system. The servers can use simple replications of table data to increase the fault tolerance of the system. However, the performance implication of these features on both JBösen and Bösen are under studied. In the future, as the aim of JBösen and Petuum project expand to cover larger size clusters, fault tolerant would be the first feature to add to the system. 22
31 Appendices 23
32
33 Appendix A Complete Row Interface Below is the complete row interface. A.1 Row Update The JBösen is implemented against the following row update and row update factory interface: /** * This interface specifies a row update for a B{\" osen PS table. An * implementation of this interface does not need to be thread-safe. */ public interface RowUpdate { /** * Increment this row update using a RowUpdate object. * rowupdate the RowUpdate object to apply to this row * update. */ void inc(rowupdate rowupdate); /** * Serialize this row update into a ByteBuffer object. * a ByteBuffer object containing the serialized data. */ ByteBuffer serialize(); /** * This interface specifies a factory for row updates. An implementation of 25
34 * this interface must be thread-safe. */ public interface RowUpdateFactory { /** * Create and return a new row update. * new row update constructed by this factory. */ RowUpdate createrowupdate(); /** * Create a new row update from serialized data. * buffer buffer containing serialized data for a row update. new row update constructed from the serialized data. */ RowUpdate deserializerowupdate(bytebuffer buffer); A.2 Row Create Factory The JBösen is implemented against the following row and row create factory interface: /** * This interface specifies a row in a JB{\" osen PS table. An implementation of * this interface does not need to be thread-safe. */ public interface Row { /** * Increment this row using a RowUpdate object. * rowupdate the RowUpdate object to apply to this row. */ void inc(rowupdate rowupdate); /** * Serialize this row into a ByteBuffer object. * a ByteBuffer object containing the serialized data. 26
35 */ ByteBuffer serialize(); /** * This interface specifies a factory for rows. An implementation of this * interface must be thread-safe. */ public interface RowFactory { /** * Create and return a new row. * new row constructed by this factory. */ Row createrow(); /** * Create a new row from serialized data. * buffer buffer containing serialized data for a row. new row constructed from the serialized data. */ Row deserializerow(bytebuffer buffer); 27
36 28
37 Appendix B Complete PsApplication Interface /** * This class abstracts away some of the boilerplate code common to application * initialization and execution. In particular, this class takes care of * PsTableGroup initialization and shutdown, as well as worker thread * spawning, registering, and de-registering. An application using this class * should extend from it and implement initialize() and * runworkerthread(int). */ public abstract class PsApplication { /** * This method is called after PsTableGroup is initialized and * before any worker threads are spawned. Initialization tasks such as * table creation and data loading should be done here. Initializing * PsTableGroup is not required. */ public abstract void initialize(); /** * This method executes the code for a single worker thread. The * appropriate number of threads are spawned and run this method. * Worker thread register and de-register are not required. * threadid the local thread ID to run. */ public abstract void runworkerthread(int threadid); 29
38 /** * This method runs the application. It first initializes PsTableGroup, * then calls initialize(), and finally spawns * numlocalworkerthreads threads, each of which runs a single * instance of runworkerthread(int). This method returns after all * worker threads have completed and PsTableGroup has been shut * down. * config the config object. */ public void run(psconfig config) { PsTableGroup.init(config); initialize(); int numlocalworkerthreads = PsTableGroup.getNumLocalWorkerThreads(); Thread[] threads = new Thread[numLocalWorkerThreads]; for (int i = 0; i < numlocalworkerthreads; i++) { threads[i] = new PsWorkerThread(); threads[i].start(); for (int i = 0; i < numlocalworkerthreads; i++) { try { threads[i].join(); catch (InterruptedException e) { e.printstacktrace(); PsTableGroup.shutdown(); 30
39 Appendix C Matrix Factorization implemented Using JBösen APIs Some code are omitted for clarity purposes. // Main worker thread public void run() { // Get table by ID. DoubleTable LTable = PsTableGroup.getDoubleTable(LTableId); DoubleTable RTable = PsTableGroup.getDoubleTable(RTableId);... int evalcounter = 0; for (int epoch = 1; epoch <= numepochs; epoch++) { double learningrate = learningrateeta0 * Math.pow(learningRateDecay, epoch - 1); for (int batch = 0; batch < numminibatchesperepoch; ++batch) { for (int ratingid = elemminibatchbegin; ratingid < elemminibatchend; ratingid++) { Rating r = ratings.get(ratingid); MatrixFactCore.sgdOneRating(r, learningrate, LTable, RTable, K, lambda); PsTableGroup.clock(); // clock every minibatch. evalcounter++; PsTableGroup.globalBarrier(); 31
40 // Print all results.... // Perform a single SGD on a rating and update LTable and RTable // accordingly. public static void sgdonerating(rating r, double learningrate, DoubleTable LTable, DoubleTable RTable, int K, double lambda) { int i = r.userid; int j = r.prodid; // Cache a row for DenseRow bulk-read to avoid locking at each read // of entry. DoubleRow LiCache = LTable.get(i); DoubleRow RjCache = RTable.get(j); double pred = 0; // <Li, Rj> dot product is the predicted val for (int k = 0; k < K; ++k) { pred += LiCache.get(k) * RjCache.get(k); assert!double.isnan(pred) : "i: " + i + " j: " + j + " k: " + k; // Now update L(i,:) and R(:,j) based on the loss function at // X(i,j). // The loss function at X(i,j) is ( X(i,j) - L(i,:)*R(:,j) )ˆ2. // // The gradient w.r.t. L(i,k) is -2*X(i,j)R(k,j) + // 2*L(i,:)*R(:,j)*R(k,j). // The gradient w.r.t. R(k,j) is -2*X(i,j)L(i,k) + // 2*L(i,:)*R(:,j)*L(i,k). DoubleRowUpdate LiUpdate = new DenseDoubleRowUpdate(K); DoubleRowUpdate RjUpdate = new DenseDoubleRowUpdate(K); double grad_coeff = -2 * (r.rating - pred); double nnzrowi = LiCache.get(K); double nnzcolj = RjCache.get(K); for (int k = 0; k < K; ++k) { // Compute update for L(i,k) double LikGradient = grad_coeff * RjCache.get(k) + 2 * lambda / nnzrowi * LiCache.get(k); LiUpdate.set(k, -LikGradient * learningrate); // Compute update for R(k,j) double RkjGradient = grad_coeff * LiCache.get(k) 32
41 + 2 * lambda / nnzcolj * RjCache.get(k); RjUpdate.set(k, -RkjGradient * learningrate); LTable.inc(i, LiUpdate); RTable.inc(j, RjUpdate); 33
42 34
43 Bibliography [1] W. Dai, A. Kumar, J. Wei, Q. Ho, G. Gibson, and E. P. Xing. High-performance distributed ml at scale through parameter server consistency models. In AAAI Conference on Artificial Intelligence, , 2.2, 2.3, 2.4 [2] Seunghak Lee, Jin Kyu Kim, Xun Zheng, Qirong Ho, Garth A Gibson, and Eric P Xing. On model parallelization and scheduling strategies for distributed machine learning. Advances in neural information processing systems, , 6 [3] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, [4] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed graphlab: A frame-work for machine learning and data mining in the cloud. In PVLDB, [5] Priit Potter. How many java developers are there in the world?, URL plumbr.eu/blog/java/how-many-java-developers-in-the-world. 1 [6] R. Power and J. Li. Piccolo. Building fast, distributed programs with partitioned tables. In OSDI, [7] Jinliang Wei, Wei Dai, Abhimanu Kumar, Xun Zheng, Qirong Ho, and E. P. Xing. Consistent bounded-asynchronous parameter servers for distributed ml. Manuscript, [8] T. White. Hadoop: The definitive guide. O Reilly Media,Inc., [9] Eric P. Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. Petuum: A new platform for distributed machine learning on big data. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, , 2.1, 2.1 [10] M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In HotCloud 2010,
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
White Paper. Optimizing the Performance Of MySQL Cluster
White Paper Optimizing the Performance Of MySQL Cluster Table of Contents Introduction and Background Information... 2 Optimal Applications for MySQL Cluster... 3 Identifying the Performance Issues.....
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Developing MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
Hadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Distributed R for Big Data
Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies large-scale analysis. It extends R. R is a single-threaded environment which limits its utility for
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses
Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses Peter Pietzuch [email protected] Large-Scale Distributed Systems Group Department of Computing, Imperial College London
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads
89 Fifth Avenue, 7th Floor New York, NY 10003 www.theedison.com @EdisonGroupInc 212.367.7400 IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads A Competitive Test and Evaluation Report
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
How to Choose Between Hadoop, NoSQL and RDBMS
How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director [email protected] Dave Smelker, Managing Principal [email protected]
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
The Sierra Clustered Database Engine, the technology at the heart of
A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel
1. Comments on reviews a. Need to avoid just summarizing web page asks you for:
1. Comments on reviews a. Need to avoid just summarizing web page asks you for: i. A one or two sentence summary of the paper ii. A description of the problem they were trying to solve iii. A summary of
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Using an In-Memory Data Grid for Near Real-Time Data Analysis
SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses
In Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science [email protected] June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
A survey of big data architectures for handling massive data
CSIT 6910 Independent Project A survey of big data architectures for handling massive data Jordy Domingos - [email protected] Supervisor : Dr David Rossiter Content Table 1 - Introduction a - Context
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Data Management in the Cloud
Data Management in the Cloud Ryan Stern [email protected] : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
HadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
Bigdata High Availability (HA) Architecture
Bigdata High Availability (HA) Architecture Introduction This whitepaper describes an HA architecture based on a shared nothing design. Each node uses commodity hardware and has its own local resources
Architectures for massive data management
Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet [email protected] October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with
CitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
A programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
Understanding Neo4j Scalability
Understanding Neo4j Scalability David Montag January 2013 Understanding Neo4j Scalability Scalability means different things to different people. Common traits associated include: 1. Redundancy in the
Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released
General announcements In-Memory is available next month http://www.oracle.com/us/corporate/events/dbim/index.html X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released
MapReduce and the New Software Stack
20 Chapter 2 MapReduce and the New Software Stack Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. In many of these applications, the
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara
CS535 Big Data - Fall 2015 W1.B.1 CS535 Big Data - Fall 2015 W1.B.2 CS535 BIG DATA FAQs Wait list Term project topics PART 0. INTRODUCTION 2. A PARADIGM FOR BIG DATA Sangmi Lee Pallickara Computer Science,
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect IT Insight podcast This podcast belongs to the IT Insight series You can subscribe to the podcast through
STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day
STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day Neha Narkhede Co-founder and Head of Engineering @ Stealth Startup Prior to this Lead, Streams Infrastructure
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Big Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
Scalability and Design Patterns
Scalability and Design Patterns 1. Introduction 2. Dimensional Data Computation Pattern 3. Data Ingestion Pattern 4. DataTorrent Knobs for Scalability 5. Scalability Checklist 6. Conclusion 2012 2015 DataTorrent
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Networking in the Hadoop Cluster
Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Parquet. Columnar storage for the people
Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li [email protected] Software engineer, Cloudera Impala Outline Context from various
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
CSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus
