III. SYSTEM ARCHITECTURE
|
|
|
- Diana Marion Bradford
- 10 years ago
- Views:
Transcription
1 SQLMR : A Scalable Database Management System for Cloud Computing Meng-Ju Hsieh Institute of Information Science Academia Sinica, Taipei, Taiwan [email protected] Chao-Rui Chang Institute of Information Science Academia Sinica, Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan [email protected] Li-Yung Ho Institute of Information Science Academia Sinica, Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan [email protected] Jan-Jan Wu Institute of Information Science, Research Center for Information Technology Innovation Academia Sinica Taipei, Taiwan [email protected] Pangfeng Liu Department of Computer Science and Information Engineering, Graduate Institute of Networking and Multimedia, National Taiwan University Taipei, Taiwan [email protected] Abstract As the size of data set in cloud increases rapidly, how to process large amount of data efficiently has become a critical issue. MapReduce provides a framework for large data processing and is shown to be scalable and fault-tolerant on commondity machines. However, it has higher learning curve than SQL-like language and the codes are hard to maintain and reuse. On the other hand, traditional SQL-based data processing is familiar to user but is limited in scalability. In this paper, we propose a hybrid approach to fill the gap between SQL-based and MapReduce data processing. We develop a data management system for cloud, named SQLMR. SQLMR complies SQL-like queries to a sequence of MapReduce jobs. Existing SQL-based applications are compatible seamlessly with SQLMR and users can manage Tera to PataByte scale of data with SQL-like queries instead of writing MapReduce codes. We also devise a number of optimization techniques to improve the performance of SQLMR. The experiment results demonstrate both performance and scalability advantage of SQLMR compared to MySQL and two NoSQL data processing systems, Hive and HadoopDB. Index Terms cloud data management, NoSQL framework, SQL to NoSQL translation and optimization, MapReduce I. INTRODUCTION The advent of cloud computing and hosted software as a service is creating a novel market for data management. Cloud-based DB services are starting to appear, and have the potential to attract customers from very diverse sectors of the market, from small businesses aiming at reducing the total cost of ownership, to very large enterprises seeking high-profile solutions spanning on potentially thousands of machines. At the same time, due to the ever increasing size of data sets, traditional parallel database solution can be prohibitively expensive. To be able to perform this type of analysis in a cost-effective manner, several companies have developed distributed data storage and processing systems on large clusters of shared-nothing commodity servers, including Google File System [1], BigTable [2], MapReduce [3], Hadoop [4], Amazon s Simple Storage Service (S3) [5], SimpleDB [6], Microsoft s SDS Cloud database [7]. There are also various NoSQL databases used to manage large amounts of data, including MongoDB [8], Apache CouchDB [9], Cassandra [10] and Dynamo [11]. Many of these cloud databases are designed to run on a cluster of hundreds to thousands of nodes, and are capable of serving data ranging from hundreds of terabytes to petabytes. Compared with traditional relational database servers, such cloud databases may offer less querying capability and often weaker consistency guarantees, but scale much better by providing built-in support on availability, elasticity, and load balancing. On the other hand, data management tools are an important part of relational and analytical data management business since business analysts are often not technically advanced and do not feel comfortable interfacing with low-level database software directly. These tools typically interface with the database using ODBC or JDBC, so database software that want to work these products must accept SQL queries. Therefore, a novel technology to combine DBMS capability with Cloudscale scalability is highly desirable. In this paper, we propose a hybrid solution, called SQLMR,
2 that combines the programming advantage of SQL with the fault tolerant, heterogeneous cluster, scalable capabilities of MapReduce. Users of SQLMR can write data management programs with familiar query language or to run existing programs without modification. SQLMR provides a compiler to translate a SQL program to a MapReduce program, and execute it in a MapReduce system. To achieve high performance in data processing, we also devise a number of optimization techniques. The major contributions of this work are summarized as follows. A SQL to MapReduce compiler and runtime framework, called SQLMR. Currently, SQLMR supports a subset of SQL queries that, to our knowledge, are sufficient to support various large-scale analytical data management applications, such as on-line analytical processing (OLAP), data mining, etc. A low-overhead data file construction technique that enables fast dynamic conversion of SQL database files to HDFS (Hadoop distributed file system) files that can be accepted as input files by the MapReduce runtime engine. This technique significantly reduces data conversion time between SQL and MapReduce. Effective database partitioning and indexing techniques for fast locating of queried data in HDFS and reducing disk I/O for range queries. A query result caching mechanism that can avoid reprocessing of redundant queries. Optimization techniques for Hadoop s MapReduce runtime system to further reduce query processing time. We conduct extensive experiments to evaluate the effectiveness of SQLMR. The comparison with Hive and HadoopDB, two well-known MapReduce based NoSQL database managent systems, and MySQL and MySQL cluster demonstrate the performance and scalability advantage of SQLMR. The rest of the paper is organized as follows, section II describes the related data management systems, section III presents the system architecture of SQLMR and the interaction between the components of the system. Section IV presents the optimization techniques we devise for SQLMR. Section V reports our experiment results, and Section VI gives some concluding remarks. II. RELATED WORK Current commercial Cloud DB products such as Amazon s Simple Storage Service (S3) [5], SimpleDB [6] and Relational Database Service (RDS) [12], Microsoft s SQL server [13] and SDS [7] Cloud database, all claim to support SQL. However, most of them do not satisfy many of the desirable properties of Cloud DBMS. For example, Amazon s SimpleDB only supports a small subset of SQL queries. It does not support aggregation and joins types of complex queries. Microsoft s SDS, although supports full SQL functionality, is far more inferior to traditional SQL servers in both performance and scalability. Bigtable [2] is a share-nothing archtecture and it is columnoriented, which is optimized for read. It is designed for high availability and high performance for massive read and write operations on a hundred-to-thousand machines cluster. It provides low level API for data manipulation. It also provides a self-managing policy to achieve load balancing and dynamic addition/removal of servers. However, Bigtable does not support SQL query. It only supports query by row key [14]. This makes it harder for users to deploy existing applications which work on transitional databases. Moreover, since general-purpose applications may concurrently access a row but different columns, supporting only single-row transaction may degrade the performance. It may need a more fine-grain lock rather than a single row lock to allow concurrent access on a row but different columns. Since Bigtable is built on Google File System, it can provide good scalability and fault tolerance. However, it lacks support to interface with existing data management tools. S3 [5] provides a simple web service interface to store and retrieve data. It gives developer the highly scalable, reliable and fast data accessing. Unlike S3, SimpleDB does not store raw data, it takes the data as input and expands it to create indices across multiple dimensions, which enables fast query of data. In addition, S3 uses dense storage device to store large object and SimpleDB uses less dense device to store small bits of data. The structured data of SampleDB is organized in domains, like a spreadsheet, in which user can insert data, retrieve data and run queries. Each domain consist of items which are described by attribute name-value pairs. An attribute can have multiple values. SimpleDB keeps multiple copies of each domain. When the data is updated, all copies of the data are updated. However, it takes time for the update to propagate to all storage locations. The data will eventually be consistent, but an immediate read might not show the change. SimpleDB may be suitable for application like shopping cart. For applications which need strict consistency, SimpleDB is not a feasible solution. Moreover, SimpleDB does not support aggregate operations like join, group and sorting. The developers have to implement these operations by themselves. Amazon RDS [12] is a web service to set up, operate and scale a relational database in the cloud. It provides full capabilities of MySQL database (5.1). This means the applications which work with existing MySQL databases can work seamlessly with Amazon RDS. Traditional DBMS like MySQL [15] has been widely used in many domains. However, to statisfy ACID properties, it uses locking mechanism to guarantee concurrent data processing, which limits the scalability of such system. Cloud DBMS has more relaxed ACID constraints, for example, many NoSQL databases only guarantee eventually consistency and result in greater scalability. MySQL cluster [16] is a share-nothing, distributed realtime database. It uses synchronous replication through a twophase commit mechanism and it guarantees data availability. It claims the fast data accessing by storing the indexed columns
3 in main memory and the ability to service tens of thousands of transactions per second. MySQL cluster provides support to interface with current data management tools. However, its scalability in both data size and machine size is not acceptable for cloud computing. Declarative language like SQL are often unnatural and restrictive to programmers, in an other way, the code written by procedural language like MapReduce is hard to maintain and reuse. Several efforts have been devoted to combining the DBMS capability and the scalability of MapReduce, including Pig, Hive and HadoopDB, as described in the following. Yahoo! develops a new language, Pig Latin [17] to fit a sweet spot between these two sides. They also develop Pig compiler to compile Pig Latin language to a plan of MapReduce jobs. However, legacy SQL codes are not compliant with Pig Latin, which limits the portability of existing applications to Pig. Similar to Pig, Hive [18] also provides a SQL-like query language named HiveQL, which makes it easy to be compatible with existing applications using SQL queries. Hive also compiles HiveQL to map-reduce code. Compared with Pig, Hive produces more efficient map-reduce code. HadoopDB system [19] is a data management system that combines DBMS capability and MapReduce techniques. It targets analytical workloads on structured data and is designed to run on commodity machines. HadoopDB inherits the scalability of Hadoop and the authors claim it achieves superior perofomance compared with current parallel DBMS. The major differences between these hybrid data management systems (SQLMR, Pig, Hive and HadoopDB) can be summarized as follows. We devise a number of novel optimization techniques to improve performance of query processing in SQLMR: (1) a low-overhead data file construction technique that enables fast dynamic conversion of SQL database files to HDFS (Hadoop distributed file system) files that can be accepted as input files by the MapReduce runtime engine. This technique significantly reduces data conversion time between SQL and MapReduce, (2) a set of effective database partitioning and indexing techniques for fast locating of queried data in HDFS and reducing disk I/O for range queries, (3) a query result caching mechanism that can avoid re-processing of redundant queries, and (4) optimization techniques for Hadoop s MapReduce runtime system to further reduce query processing time. Our experiment results in later section demonstrate the performance and scalability advantage of SQLMR against these three systems. III. SYSTEM ARCHITECTURE In this section, we describe the SQLMR system architecture. Since most users are familiar with SQL-like languages, the goal of the SQLMR system is to design a framework that combines the programming advantage of SQL and the scalability and fault tolerance of MapReduce. Figure 1 illustrates the concept of SQLMR. The system accepts SQL queries as input and translates them to a sequence of MapReduce jobs. When the MapReduce jobs are completed, the system returns the query results to the user in SQL form. Figure 1 depicts the system architecture of SQLMR. There are four main components in SQLMR: SQL-to-MapReduce Compiler, Query Result Manager, Database Partitioning and Indexing Manager, and the Optimized Hadoop system. The database partitioning/indexing manager maintains the information of table scheme, indexed files and metadata. The other three components interact with database partitioning/indexing manager for acquiring necessary information when processing a query. For simplicity, we did not draw arrows between database partitioning/indexing manager and the other three components in Figure 1. We describe each component in the following paragraphs. Fig. 1. System architecture of SQLMR. a) SQL-to-MapReduce Compiler:. The compiler takes SQL queries as input and translates them to a sequence of map-reduce jobs. In the following example, let student hw be a database table containing 3 columns (id, hw, score). The following SQL query counts the number of students who s hw 1 score is higher than 80: SELECT COUNT(s.id) FROM student_hw as s WHERE s.score > 80 AND s.hw=1 The SQL query is translated to a pair of map-reduce job. First, the map phase reads records from the student hw table and produces an output record with two parts. The first part is called the key, which is populated with student id(s.id) who s hw1 score is higher than 80. The second is the value, which simply contains 1. The reduce phase then reads the key-value pairs and adds all the 1 s to obtain the total count. The query result returned by SQLMR is a value that represents the total count. Note that the number of mappers and reducers to execute a query are decided by the underlying Hadoop
4 runtime system. In the future, we may provide the functionality for users to decide the number of mappers and reducers for processing a query. Table I outlines the set of query operations currently supported in SQLMR. Note that these query operations are readonly. This is because, unlike transactional data processing, analytical data processing (such as OLAP) usually reads and processes large input data sets without modifying them. Currently, SQLMR connects with multiple SQL servers to form a hybrid data management system, in which smallersize queries and write operations are processed by the SQL servers, while large-size read operations are processed by SQLMR. The management of transactional data processing on the multiple SQL servers and the interaction between SQLMR and SQL servers in the hybrid system is out of the scope of this paper due to page limit. TABLE I THE QUERY OPERATIONS CURRENTLY SUPPORTED IN SQLMR Type Basic Operations Computing Operations Condition Operations Supported Functions SELECT WHERE ATTRIBUTES (Single, Multiple, *) SUM DISTINCT JOIN COUNT JOIN MULTI-TABLE SUB-QUERY GROUP BY BETWEEN-AND MULTI-CONDITION ORDER BY (DESC, ASC) DATA OPERATION b) Query Result Manager:. The query result runtime system caches the result for each query. When a new query enters SQLMR, the compiler first passes the query to Query Result Manager to compare the query with previous ones in the log. If there are valid cached result for that query, the result is returned to user without re-processing the query. Otherwise, the compiler will parse the query and generates optimized MapReduce code. The cached results will be invalid when a user updates or deletes data from the database. c) Database Partitioning and Indexing Manager (DPIM).: This system component manages data files and indexing. When new data is added into the system, DPIM partitions the new data and creates index for the new data. With smart partitioning and indexing, SQLMR can do fast locating of queried data blocks as well as identifying exact data blocks that need be accessed in range query in order to reduce disk I/O. The partitioning and indexing techniques are what distinguish our work from other related efforts, which typically export the entire data file to the MapReduce runtime system. d) Optimized Hadoop.: Hadoop system is a software framework for distributed processing of large data sets on compute clusters. Our Compiler generates optimized mapreduce jobs and execute the jobs on the Hadoop system. We devise a set of optimizations, such as cross-rack communication optimization, to improve the performance of the Hadoop system. IV. PERFORMANCE OPTIMIZATION A. Data Partitioning and Pre-processing In this section, we describe SQLMR s approch to transferring data from traditional RDBMS to the Hadoop MapReduce system. We also give an overview of HadoopDB s approach, which will be used as a basis for comparison in our experiments. Figure 2 is the flowchart of HadoopDB. At the beginning, the user needs to export all data from PostgreSQL to a comma-separated value (CSV) text file and put the file in HDFS for subsequent hash-partitioning pre-processing. The purpose of hash-partitioning is to push more query logic into databases (e.g. joins). This can be done in two phases. First, the data file needs to be loaded into HDFS. Then, a HadoopDB custom-made Hadoop job, named GlobalHasher, re-partitions data into a specified number of partitions (e.g. number of nodes in a cluster). The next step is to download all partitioned data from HDFS to the local disk and import the split data to local PostgreSQL server on each node. In contrast, in our SQLMR framework, all database files are stored in HDFS directly without having to pre-process them. This design can reduce the pre-processing time significantly. Although the hybrid design of HadoopDB allows users to access PostgreSQL database directly, there are two problems that need be overcome. First, in HadoopDB each PostgreSQL server only stores a partition of the database. Therefore, the user has to merge all partial query results returned from each database partition to get the final result. Second, users need to recovery the system manually should a PostgreSQL Server crashes. In comparison with HadoopDB, our SQLMR framework stores all data in HDFS, which saves the user the burden of gathering the partial query results. The process of recovery can also be done by Hadoop MapReduce System automatically. Furthermore, storing data in HDFS inherits the fault-tolerant capability provided by HDFS. HDFS replicates the data remotely to ensure data availability if machines crash. SQLMR relies on HDFS for data replication, and the replication level of data is also decided by HDFS. In order to further reduce the time for data loading and speed up data processing, we develop a number of optimizations in the SQLMR framework, as shows in Figure 3. At the beginning, SQLMR analyzes the table schema to get the data size of one record. Next, SQLMR reads all data from database server and partitions the data according to the analyzed schema and block size of HDFS. Finally, the hashed and partitioned table data is stored in HDFS. The details of partition will be described in next subsection. HadoopDB implements a hybrid database system by developing a Database Connector to retrieve data from traditional PostgreSQL database. The Database Connector in HadoopDB is similar to a normal database client implemented in JDBC (Java Database Connectivity). During the HadoopDB experiment, we found that HadoopDB takes a lot of time to retrieve
5 data from PostgreSQL and causes very heavy I/O loading. In order to overcome the issue, we also develop a low-overhead data file construction technique that enables fast dynamic conversion of SQL database files to the format that can be accepted as input files by the MapReduce runtime engine. The file construction technique implements the InputFormat interface of Hadoop MapReduce system. The interface is called by Mapper function to read the needed data from HDFS and can be designed to read any data in any format. In SQLMR, we implements a custom InputFormat Java class, labeled as Data Connector in Figure 4, which can read MySQL database files directly without having to export database as text files. This technique significantly reduces data conversion time between SQL and MapReduce. The data connector assigns one mapper per SQL DB, because assigning too many mappers to one host will result in too heavy disk I/O access and decrease the throughput. In addition, the data connector balances the workload on the mappers using the following strategy. First, each mapper connects to every SQL DB and issues SELECT COUNT to get the total number of needed data stored in each server. Next, each mapper randomly selects the SQL DB it will retrieve data from. Finally, each mapper retrieves approximately the same amount of partial data from its selected SQL DB using the SELECT LIMIT command. Fig. 2. The SQL to HDFS data loading process in HadoopDB. Fig. 4. The Data Connector that enables reading data from database files. stored data is numerous and we need to identify the data we are interested in very soon. In SQLMR, we employ two indexing techniques to accelerate data searching. SQLMR chooses a suitable index technique depending on the characteristics of the database. We next introduce the two index techniques. 1) Partition Index: In this approach, the files storing database tables are split into fixed length files. The size can be determained by the size of block in HDFS, such that a file would be completely contained in one block. Each file contains the records with a range of series keys. The range of keys is decided by the schema of table and the size of a file. For example, in Figure 5, there are 4 columns in a table. After table schema analysis, a record is 2KB and a file is 64MB. Then, SQLMR will partition one table data file into multiple partitioned table data files. A file will contain a series of data rows in which the keys range from 1 to 32,768. The next file will contain data rows with keys ranging from 32,769 to 65,537, and so on. The insertion, deletion and search operation for a key are all constant time O(1), because the file contains the corresponding record and the offset in a file can be calculated by the key of the target record. This approach is suitable for data with dense key space, since we pre-allocate the space for a record in a file. For general cases, we use B+ tree index. Fig. 3. The SQL to HDFS data loading process in SQLMR. B. Data Indexing Index is a data structure to facilitate and improve the performance of data retrieving and searching in traditional DBMS. For cloud DBMS, it becomes even critical since the Fig. 5. Illustration of data partitioning in SQLMR.
6 2) B+ tree Index: B+ tree index are extensively and commonly used in database indexing. Many open-source and commercial database product, like Oracle [20] and MySQL, apply B+ tree to index the data. It is a general approach and applicable to various applications. In SQLMR, the B+ tree structure is maintained by the HDFS master, and we modify the DFSClient module in Hadoop such that we can query the block loaction by a key through the tree. The search, deletion, and insertion are all logarithmic amortized time O(logN) where N is the number of nodes in the tree. In order to build the index tree, we need to query the master node for the block information. The master node returns all the locations of a block, including the master copy and its replicas. Our B+ tree node stores all the replicated blocks information to maintain high availability of data blocks information. The query process is as follows. SQLMR reveives a query of a search key and finds the corresponding block through the tree. The internal node of the tree only stores the information of key for seaching, and the leaf node of the tree stores the data of block information, including the block ID and location. Figure 6 illustrates the structure of the B+ tree. To facilitate block mergence and split, the keys in a block have to be sorted. We sort the key in a block when the blocks is merged or split to reduce the extra overhead of sorting and since the number of keys in a block is bound by a constant C, where the size of block C = the size of a record Therefore, the extra time complexity of sorting is O(1). Fig. 6. C. Hadoop Optimization The structure of B+ tree. User query is compiled as MapReduce jobs and run on Hadoop, therefore, the performance of Hadoop framework is critical to SQLMR perofrmance. We employ ours previous work on cross-rack optimization in Hadoop framework to improve the performance of Hadoop. We give a brief introduction to this optimization technique in the following. MapReduce employs all-to-all communication model between mappers and reducers. This results in saturation of network bandwidth of top-of-rack switch in shuffle phase and straggles some reducers and increases job execution time. In current Hadoop implementation, the placement of reducers is random which may result in network load unbalancing and make the reducers on a busy rack become stragglers. In our previous work, we model the traffics in shuffle phase and give two optimal algorithms to balance the network load among racks by placing the reducers to racks properly. The experiment shows the improvement achieves 32% in PageRank application. In [21], we propose a Reducer Placement Problem (RPP) which is defined as follows. Give the number of racks, the number of mappers on each rack and the number of reducers to schedule, how do we determine the number of reducers run on each rack? We give a simplified traffics model and derive an objective funtion to represent the amount of traffics of a rack. In this model, the traffics of a rack is a function of number of reducers (r i ) run on it and we formulate RPP as a minmax optimization problem. Formally, suppose we have N racks, M mappers and R reducers, the number of mapppers on each rack {m 1, m 2,..., m N } are known. The traffics of rack i is f i (r i ), we want to find the number of reducers on each rack {r 1, r 2,..., r N } such that min{ argmax r i,i {1,,N}, i r i=r {f 1 (r 1 ),, f N (r N )} } One of our optimal algorithm is in greedy manner. It places one reducer to a rack at a time. The main idea is that, always place the reducer on the rack with minimum traffics currently. Algorithm 1 demonstrates the pseudo code of our greedy algorithm. Note that we use an array state tuple to store the number of reducers on each rack. For example, if we have four racks and ten reducers to schedule, our greedy algorithm reutrns state tuple = [1, 2, 3, 4], whcih means placing one reducer in rack 1, two in rack 2, three in rack 3 and four in rack 4. We choose five popular and representative applications as our benchmark suite and conduct the experiment in a four-rack cluster. The evaluation metric is the speedup rate compared with un-optimized Hadoop. Table II shows the speedup of each application. The best improvement is PageRank which achieves 32% and FPM does not have significant improvement. This is because FPM has a very skewed distribution in intermediate data size and our traffic model assumes the size of intermediate data is a constant. In constrast, other applications have approximately the same size of intermediate data. Benchmark Speedup (%) Grep 9.35 WordCount PageRank K-mean 14.7 FPM 1.76 TABLE II SPEEDUP OF BENCHMARKS
7 Algorithm 1 Greedy Algorithm for RPP Require: The number of mappers on each rack : {m 1, m 2,..., m N } Ensure: A reducer state tuple : {r 1, r 2,..., r N } N number of racks M number of total mappers R number of total reducers state tuple[n] {0, 0,..., 0} for i = 1 to R do minimal for j = 1 to N do traffic = (M 2m j ) (state tuple[j] + 1) + m j R if traffic < minimal then candidate = j end if end for state tuple[candidate] + + end for return state tuple A. Experiment Setting V. EXPERIMENT In the experiment, we use SysBench as database benchmark and compare SQLMR with other database systems, including standalone MySQL on Ceph, in which data files are stored on the Ceph distributed file system, MySQL cluster, and two MapReduce-based systems: Hive and HadoopDB. SysBench is a modular, cross-platform and multi-threaded benchmark tool for evaluating OS parameters that are important for a system running a database application under intensive load. We use the OLTP module of SysBench to benchmark a real database performance. OLTP can generate a lot number of sequential data indexed by column id. It can also generate transactional queries. Since Hive and HadoopDB only handle read operations with the MapReduce framework, in our experiments, we only compare performance of read operations, including range sum and join queries. We use Ceph DFS as the underlying distributed file system for MySQL in order to allow MySQL to accommodate large dataset. MySQL cluster loads all data into memory to achieve fast response time. However, the total size of memory limits its scalability. Hive is a data warehouse infrastructure built on top of Hadoop. Hive defines a simple SQL-like query language that enables users familiar with SQL to query the data without writing MapReduce codes. HadoopDB is an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. The experiment contains two parts: data scalability and system scalability. The former is to show the scalability w.r.t. increase in data size while the number of nodes is fixed at 10 and the latter to show the scalability w.r.t. increase in system size with fixed 10GB data size per node and totally 64 nodes. Each node contains 2 CPU cores rated at 2.27GHz, 4GB memory, 200GB disk space and all of them are connected by one Gigabit Ethernet switch. The result time of each experiment is measured by the time command and each data point is the average of 10 runs. B. Experiment results 1) Effect of Data Size: This set of experiments compare the scalability w.r.t. increase in data size. The number of nodes is fixed at 10, and the data size varies from 512MB to 1TB. Figure 7 shows the execution time of MySQL, MySQL cluster, Hive, HadoopDB and SQLMR on SELECT operation with different data sizes. Figure 8 and Figure 9 give graphical illustrations of the performance comparison. The SQL query is as follows. SELECT sum(id) FROM table WHERE id >= max(id)/2 and id <= max(id) For small data size (Figure 8), we found that MySQL and MySQL cluster outperform MapReduce-based systems when the data size is smaller than 4GB. When the data size increases beyond 8GB, both of them are outperformed by MapReduebased systems. The reason is that MySQL does not parallelize processing of single query. MySQL cluster employs inmemory database technique, which writes all data to memory before starting database operation and thus is limited by the size of the physical memory. In our experiment environment, MySQL cluster would crash due to out of memory when the data size reaches 64GB. In Figure 9, MySQL crashes when the data size reaches 772GB. The reason is that a MySQL data table can only accommodate 2 32 data records. 772GB is the maximum data size that can be generated by the Sysbench benchmark because of such constraint. For MapReduce-based systems, we found that the execution time of HadoopDB increases dramatically with the increase in data size. The reason is that HadoopDB incurs higher I/O workload caused by the multi-phase data pre-processing described in Section IV-A. SQLMR consistently outperforms HadoopDB because of the performance improvement by various optimizations described in Section IV. SQLMR is 2.82 times faster than HadoopDB with 32GB data size and times faster than HadoopDB with 1TB data size. Furthermore, SQLMR is 1.41 times faster than Hive on average. Figure 10 shows the execution time of Hive, HadoopDB and SQLMR on JOIN operation with different data sizes. Figure 11 and Figure 12 give graphical illustrations of the performance comparison. The SQL query is as follows. SELECT sum(tabl1.id) FROM table1 JOIN table2 ON (table1.id = table2.id) WHERE table2.id >= max(table2.id)/2 and table1.id <= max(table1.id) For small data size (Figure 11), we found that the execution time of HadoopDB increases dramatically with the increase in data size. The reason is that HadoopDB incurs higher I/O workload caused by the multi-phase data pre-processing described in Section IV-A. SQLMR is 1.47 times faster than HadoopDB with 1GB data size and 3.55 times faster than
8 Fig. 7. Comparison of execution time between different database systems on SELECT query with different query sizes. Fig. 10. Comparison of execution time between different database systems on JOIN query with different query sizes. Fig. 8. Comparison of execution time of SELECT query between different database systems with small data sizes. HadoopDB with 16GB data size. Furthermore, SQLMR is 1.18 times faster than Hive on average. For large data size (Figure 12), the speed-up factors of SQLMR against HadoopDB and Hive are even more significant. SQLMR is 8.23 times faster than HadoopDB with 32GB data size and times faster with 1TB data size. SQLMR is 1.29 times faster than Hive on average. Fig. 11. Comparison of execution time between different database systems on JOIN query with small data sizes 2) Effect of System Size : This set of experiments compare the scalability of different database systems w.r.t. increase in system size. The number of physcial nodes varies from 1 to 16. Each physical node runs four virtual machines (i.e., virtual nodes). The data size per virtual node is fixed at 10GB. Figure 13 compares the result of SELECT (range sum) query, and Figure 14 shows the result of JOIN query. As shown in Figure 13, HadoopDB exhibits unstable system scalability while SQLMR and Hive behave similarly and both Fig. 9. Comparison of execution time of SELECT query between different database systems with large query sizes. Fig. 12. Comparison of execution time between different database systems on JOIN query with large data sizes.
9 exhibit more stable scalability than HadoopDB. All of the systems perform worst when the number of virtual nodes (i.e. virtual machines) is four. This is because the four virtual machines residing on the same physical node saturate source utilization on that node. When the number of virtual machines increases from 4 to 16, the workload is shared between multiple physical nodes, which increases parallelism and results in decrease of execution time. When the number of virtual machines increases to 32, the network becomes the bottleneck and causes increase in the execution time. From Figure 13, we can see that the performance improvement ratio of SQLMR against HadoopDB ranges from 4.16 (1 node) to 4.95 (64 nodes). The performance improvement ratio of SQLMR against Hive ranges from 1.67 to Fig. 14. Comparison of system scalability between different database systems on JOIN query, with fixed data size per node and varied number of nodes. Fig. 13. Comparison of execution time between different database systems on SELECT query, with fixed data size per node and varied number of nodes. Figure 14 also shows that HadoopDB exhibits unstable system scalability while SQLMR and Hive behave similarly and both exhibit more stable scalability than HadoopDB. The main reason for HadoopDB s poor system scalability is that, HadoopDB uses PostgreSQL server on each local node as the database storage without the support of distributed storage system. HadoopDB uses the Map function to collect all the queried data and send the whole set of data to the reducer for computation of the final result. From Figure 13, we can see that the performance improvement ratio of SQLMR against HadoopDB ranges from 6.03 (2 nodes) to (64 nodes). The performance improvement ratio of SQLMR against Hive ranges from 1.65 to ) Comparison of Data Preprocessing: Figure 15 compares the breakdown of data pre-processing overhead between SQLMR and HadoopDB. Recall that HadoopDB requires four phases of data pre-processing (Section IV-A), while SQLMR only requires two phases. For the partition phase, SQLMR uses the split function in the Linux OS to partition database table data into small files. As shown in Figure 16, HadoopDB requires times more partitioning time than SQLMR with 512MB data size. This is because HadoopDB relies on the Map function in the MapReduce system for the data partitioning phase. This approach benefits from large parallelism provided by the MapReduce system. However, it suffers large overhead when partitioning small data file. On the other hand, the performance advantage of SQLMR decreases when the data size increases. At 1TB data size, HadoopDB requires 70% partitioning time of SQLMR (Figure 17). The reason is that the split function is not parallelized, hence the partitioning time of SQLMR increases with the increase in data size. We expect the partitioning phase of SQLMR to be improved when a parallel partitioning function is available. In terms of total pre-processing time, SQLMR is times faster than HadoopDB with 512MB data size and 2.68 times faster than HadoopDB with 1TB data size. The average speed up factor of SQLMR against HadoopDB over various data sizes is Fig. 15. SQLMR Comparison of Data Preprocessing time between HadoopDB and VI. CONCLUSION In this paper, we have proposed a hybrid solution, called SQLMR, that combines the programming advantage of SQL
10 VII. ACKNOWLEDGEMENTS This work is supported in part by National Science Council of Taiwan under grant number NSC E We would also like to thank the Academia Sinica Computing Center for providing computing and storage facilities. Fig. 16. Comparison of Data Preprocessing time between HadoopDB and SQLMR with small data sizes Fig. 17. Comparison of Data Preprocessing time between HadoopDB and SQLMR with large data sizes with the fault tolerant, heterogeneous cluster, scalable capabilities of MapReduce. Users of SQLMR can write data management programs with familiar query language or to run existing programs without modification. SQLMR provides a compiler to translate a SQL program to a MapReduce program, and execute it in a MapReduce system. To achieve high performance in data processing, we also devise a number of optimization techniques, including efficient data pre-processing, data partitioning, data indexing, query result caching, and optimization of the Hadoop runtime system. We conducted experiments using the widely used Sysbench benchmark to evaluate both data scalability and system scalability of SQLMR. We compare SQLMR with MySQL, MySQL cluster and two MapReduce-based database system: Hive and HadoopDB. Our experiment results demonstrate that SQLMR achieves significant improvement in query processing time, with improvement ratio of against HadoopDB and 1.41 against Hive for range SELECT queries, and against HadoopDB and 1.29 against Hive for JOIN queries. Our experiments also show that SQLMR has good scalability w.r.t. increase in system size. REFERENCES [1] S. Ghemawat, H. Gobioff, and S.-T. Leung, The google file system, in Proceedings of the nineteenth ACM symposium on Operating systems principles, ser. SOSP 03. New York, NY, USA: ACM, 2003, pp [Online]. Available: [2] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, Bigtable: A distributed storage system for structured data, ACM Trans. Comput. Syst., vol. 26, pp. 4:1 4:26, June [Online]. Available: [3] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, in Proceedings of the 6th conference on Symposium on Opearting Systems Design and Implementation, ser. OSDI 04, vol. 6. Berkeley, CA, USA: USENIX Association, 2004, pp [4] Hadoop, [5] Amazon simple storage service, [6] Amazon simpledb, [7] Microsoft sql azure, [8] Mongodb, [9] Couchdb, [10] Cassandra, [11] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, Dynamo: amazon s highly available key-value store, in Proceedings of twentyfirst ACM SIGOPS symposium on Operating systems principles, ser. SOSP 07. New York, NY, USA: ACM, 2007, pp [Online]. Available: [12] Amazon relational database service, [13] Microsoft sql server, [14] Nchc bigtable, [15] Mysql database, [16] Mysql cluster, [17] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, Pig latin: a not-so-foreign language for data processing, in Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ser. SIGMOD 08. New York, NY, USA: ACM, 2008, pp [Online]. Available: [18] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, Hive: a warehousing solution over a map-reduce framework, Proc. VLDB Endow., vol. 2, pp , August [Online]. Available: [19] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin, Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads, Proc. VLDB Endow., vol. 2, pp , August [Online]. Available: [20] Oracle database, [21] L.-Y. Ho, J.-J. Wu, and P. Liu, Optimal algorithms for cross-rack communication optimization in mapreduce framework, In press.
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
JackHare: a framework for SQL to NoSQL translation using MapReduce
DOI 10.1007/s10515-013-0135-x JackHare: a framework for SQL to NoSQL translation using MapReduce Wu-Chun Chung Hung-Pin Lin Shih-Chang Chen Mon-Fong Jiang Yeh-Ching Chung Received: 15 December 2012 / Accepted:
Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece [email protected].
Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece [email protected] Joining Cassandra Binjiang Tao Computer Science Department University of Crete Heraklion,
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
CitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
Comparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Data Management in the Cloud
Data Management in the Cloud Ryan Stern [email protected] : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford
SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Cloud Data Management: A Short Overview and Comparison of Current Approaches
Cloud Data Management: A Short Overview and Comparison of Current Approaches Siba Mohammad Otto-von-Guericke University Magdeburg [email protected] Sebastian Breß Otto-von-Guericke University
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
MapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
A Demonstration of Rubato DB: A Highly Scalable NewSQL Database System for OLTP and Big Data Applications
A Demonstration of Rubato DB: A Highly Scalable NewSQL Database System for OLTP and Big Data Applications Li-Yan Yuan Department of Computing Science University of Alberta [email protected] Lengdong
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
Data Management in Cloud based Environment using k- Median Clustering Technique
Data Management in Cloud based Environment using k- Median Clustering Technique Kashish Ara Shakil Department of Computer Science Jamia Millia Islamia New Delhi, India Mansaf Alam Department of Computer
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM
MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM Julia Myint 1 and Thinn Thu Naing 2 1 University of Computer Studies, Yangon, Myanmar [email protected] 2 University of Computer
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
Integrating Hadoop and Parallel DBMS
Integrating Hadoop and Parallel DBMS Yu Xu Pekka Kostamaa Like Gao Teradata San Diego, CA, USA and El Segundo, CA, USA {yu.xu,pekka.kostamaa,like.gao}@teradata.com ABSTRACT Teradata s parallel DBMS has
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
NoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Alternatives to HIVE SQL in Hadoop File Structure
Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05
Introduction to NoSQL Databases Tore Risch Information Technology Uppsala University 2013-03-05 UDBL Tore Risch Uppsala University, Sweden Evolution of DBMS technology Distributed databases SQL 1960 1970
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline
References Introduction to Database Systems CSE 444 Lecture 24: Databases as a Service YongChul Kwon Amazon SimpleDB Website Part of the Amazon Web services Google App Engine Datastore Website Part of
Introduction to Database Systems CSE 444
Introduction to Database Systems CSE 444 Lecture 24: Databases as a Service YongChul Kwon References Amazon SimpleDB Website Part of the Amazon Web services Google App Engine Datastore Website Part of
Hosting Transaction Based Applications on Cloud
Proc. of Int. Conf. on Multimedia Processing, Communication& Info. Tech., MPCIT Hosting Transaction Based Applications on Cloud A.N.Diggikar 1, Dr. D.H.Rao 2 1 Jain College of Engineering, Belgaum, India
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Report Data Management in the Cloud: Limitations and Opportunities
Report Data Management in the Cloud: Limitations and Opportunities Article by Daniel J. Abadi [1] Report by Lukas Probst January 4, 2013 In this report I want to summarize Daniel J. Abadi's article [1]
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
SOLVING LOAD REBALANCING FOR DISTRIBUTED FILE SYSTEM IN CLOUD
International Journal of Advances in Applied Science and Engineering (IJAEAS) ISSN (P): 2348-1811; ISSN (E): 2348-182X Vol-1, Iss.-3, JUNE 2014, 54-58 IIST SOLVING LOAD REBALANCING FOR DISTRIBUTED FILE
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)
WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
Review of Query Processing Techniques of Cloud Databases Ruchi Nanda Assistant Professor, IIS University Jaipur.
Suresh Gyan Vihar University Journal of Engineering & Technology (An International Bi Annual Journal) Vol. 1, Issue 2, 2015,pp.12-16 ISSN: 2395 0196 Review of Query Processing Techniques of Cloud Databases
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12
Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language
NoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Data Management Course Syllabus
Data Management Course Syllabus Data Management: This course is designed to give students a broad understanding of modern storage systems, data management techniques, and how these systems are used to
USC Viterbi School of Engineering
USC Viterbi School of Engineering INF 551: Foundations of Data Management Units: 3 Term Day Time: Spring 2016 MW 8:30 9:50am (section 32411D) Location: GFS 116 Instructor: Wensheng Wu Office: GER 204 Office
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG
1 RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG Background 2 Hive is a data warehouse system for Hadoop that facilitates
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
low-level storage structures e.g. partitions underpinning the warehouse logical table structures
DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures
Enhancing Massive Data Analytics with the Hadoop Ecosystem
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3, Issue 11 November, 2014 Page No. 9061-9065 Enhancing Massive Data Analytics with the Hadoop Ecosystem Misha
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
BIG DATA WEB ORGINATED TECHNOLOGY MEETS TELEVISION BHAVAN GHANDI, ADVANCED RESEARCH ENGINEER SANJEEV MISHRA, DISTINGUISHED ADVANCED RESEARCH ENGINEER
BIG DATA WEB ORGINATED TECHNOLOGY MEETS TELEVISION BHAVAN GHANDI, ADVANCED RESEARCH ENGINEER SANJEEV MISHRA, DISTINGUISHED ADVANCED RESEARCH ENGINEER TABLE OF CONTENTS INTRODUCTION WHAT IS BIG DATA?...
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
American International Journal of Research in Science, Technology, Engineering & Mathematics
American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629
A programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
NetFlow Analysis with MapReduce
NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok Lee Chungnam National University {teshi85, yhlee06, lee}@cnu.ac.kr 2010.04.24(Sat) based on "An Internet Traffic Analysis Method with
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
Radoop: Analyzing Big Data with RapidMiner and Hadoop
Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
