Report for the seminar Algorithms for Database Systems F1: A Distributed SQL Database That Scales
|
|
- Melinda Davidson
- 8 years ago
- Views:
Transcription
1 Report for the seminar Algorithms for Database Systems F1: A Distributed SQL Database That Scales Bogdan Aurel Vancea May Introduction F1 [1] is a distributed relational database developed by Google and it is used mainly for the Google AdWords business. F1 combines the scalability of NoSQL systems with the consistency offered by SQL databases. The name F1 is an abbreviation of Filial 1 Hybrid, which in biology stands for the first generation of the offspring of two very different parent species. The name is meant to symbolize the fact that F1 is the result of the combination of NoSQL and SQL databases. One of the most important design aspects of F1 is the fact that it is built on top of a distributed key-value store database. This key-value store is called Spanner [2], it is a NoSQL database created by Google and provides synchronous cross-datacenter replication and strong consistency. This design choice results in a relatively high commit latency for transactions, which is mitigated by F1 through various design heuristics. As a result, the latency of applications using F1 is similar to the latency of the previous database solution used by the Google AdWords product. However, F1 also provide better scalability, reliability and availability. 2 Goals The goal of this report is to analyze the approach taken by F1 for designing scalable SQL databases. For the purpose of this report, a scalable SQL database will be defined as a databases that: provides strong consistency semantics. This means that the system should always present a consistent state. A strongly consistent state is the basis for ACID transactions, however this report will not go into details about this type of transactions. is scalable both geographically and from the point of view of data storage and request load. The system should be able to scale transparently and just by adding additional nodes in order to handle more data or an increasing number of requests per second. 1
2 The architecture proposed by F1 is to build such a scalable SQL database by adding a layer of SQL processing over a distributed key-value store. The distributed key-value store should fulfill the scalability and consistency requirements for simple operations that operate on key-value pairs, while relational abstractions like tables, SQL processing and ACID transactions are implemented by an additional database middleware. The main issue that appears in such a system is the additional network latency added by the middleware server layer. Considering that distributed databases can be deployed over multiple data centers, the additional network latency could add a significant penalty to the read and write latencies. And in an age of Big Data, databases with high latencies and potentially low throughput are not acceptable. In this context, this report will present the architecture of the multilayer system proposed earlier and analyze the impact of the additional layer on the read and write latencies. Additionally, the report will present some solutions to mitigate the extra latency added by the additional layer. 3 System Architecture Figure 1: The architecture of F1 This section will detail the distributed database architecture proposed by F1. The architecture of the system is presented in Figure 1. In this model, the data is stored on the distributed key-value pair servers, while the query processing is performed by the middleware servers. In the case of the F1 database, the key-value store servers are represented by Spanner servers, while the middleware servers are represented by F1 servers. 2
3 The Spanner key-value store server use a distributed file system implemented by Google, called Colossus [3], the second generation of the Google File System [4]. Conceptually, the relational data is stored as rows in tables, however at the level of the key-value store servers each table row is stored by multiple key-value pairs. This implementation detail is abstracted by the middleware servers, which convert SQL queries operating on table rows into low-level operations that operate on key-value pairs. The database is accessed using SQL queries sent through a Client library to one of the middleware servers. The middleware servers process the SQL queries and produce a list of low-level operations operating on key-value pairs. These low-level operations are the forwarded by the middleware server to one or more of the key-value pair nodes that hold the data affected by them. Because of the strong consistency requirement, a consistent view of the data must be kept at all times and the middleware server can consider any write operation completed only after it has received an acknowledgement from the keyvalue pair servers signifying that the write was finished successfully. A common way to increase the availability and fault-tolerance of a distributed database system is to replicate the data stored across multiple nodes. In such a case, to maintain the strong consistency requirement, the middleware node would also have to obtain an acknowledgement from all of the replicas of the node that holds the data to be written before considering the write completed. The main advantage of this multilayer architecture is the fact that the data processing components are physically separated from the data storage components. Because the data is only stored on the key-value pair nodes, the data storage capacity of the system can be scaled individually from the query processing capacity. Therefore, to increase the query processing power of the system, one would only need to add more middleware nodes. Because these middleware nodes do not store any relational data, this operations does not incur a data redistribution cost. To increase the amount of data that can be stored by the system, more key-value store nodes need to be added. It is important to note that adding new storage nodes brings a data redistribution cost. If the new node is a replica of an existing node, the new node needs to load the state of the node that it replicates. If the new node is a non-replica node, the existing data stored by the system would need to be redistributed among the existing nodes. The disadvantage of this architecture is that all data access operations need at least two network round-trips in addition to the disk operation. In the case of read operations, a network request is made between the client and the middleware server, followed by an additional network request made by the middleware to the key-value store server holding the data requested. In this case, having multiple replicas of key-value store nodes would allow different requests from different clients for the same data to be sent to different nodes, thus mitigating some of the extra latency. In the case of write operations, in addition to the network request between the middleware and key-value store server, network requests need to be made to all of the replicas of the node storing the data to be updated. Unlike read operations, the presence of the replica nodes influences the write latency in a negative manner. The next section will analyze the impact of replication on write requests in a strongly consistent system. 3
4 Figure 2: Replication of a write operation using the Paxos algorithm 4 Synchronous Replication There are multiple models for the replication of writes to replicas, each model ensuring a certain consistency model. The synchronous replication model ensures that all the write requests are atomically performed on all replicas. In this replication model, the node that contains the main copy of the data is called the leader node, while the nodes containing copies of the data are simply called replicas. The replication process is initialized by the leader and is finished once all of the replicas have performed the write on the data. If multiple writes need to be replicated by a middleware server, it is possible to initialize the replication of each write on a different node, to increase parallelism. In such a case, a consensus algorithm needs to be used for replication. F1 uses the Paxos consensus algorithm for the replication process. This algorithm ensures that the replication will be finished successfully even in the presence of multiple leaders. The figure 2 shows requests made during a replication round in the Paxos algorithms. First, an SQL query is send from the client to one of the middleware servers. This update query is then converted to a single key-value operation that is then sent to the key-value store servers. The best-case scenario for the algorithm is the following: The key-value store server receiving the update initializes the replication process and sends propose messages with the new value to the replicas. The replicas can accept the newly proposed value and send an acknowledge message to the leader. The leader counts the acknowledge messages from the replicas as votes. If a majority of the replicas have accepted the proposed update, the leader can send a commit message to the replicas. It is only after the commit was performed by all of the replicas that the replication process is finished. 4
5 Figure 3: A possible normalized relational schema for mobile manufacturers If multiple updates are initialized in the same time, a single leader is chosen to perform all the updates in a valid order. This case will not be covered in the report. However, one can see that even in the simplest case, when a single write is replicated successfully, the algorithm needs 2 network round-trips between the leader and each replica: one for the propose message and another for the commit message. These additional network roundtrips increase the latency of write operations. There are several ways through which a high write latency could be mitigated in a database system. The following sections will analyze the optimizations proposed by F1 to deal with the high write latency. 5 Data Model F1 proposes using a hierarchical data model to reduce the number of writes required for update operations. The data model used by F1 is very similar to the data model used by modern relational databases. F1 stores data as rows in tables, however the internal storage is slightly different from the one of traditional databases. F1 provides some extensions to the traditional data model: explicit table hierarchies and column support for Protocol Buffers. From the logical point of view, in the clustered hierarchical model the tables are organized in a hierarchy. In this hierarchy, each table can be a parent table of one or more child tables. Moreover, a table that has no parent is called a root table. From the physical point of view, all of the child tables are stored clustered with the parent tables. This means that the rows of the parent and child tables are interleaved. The remainder of this section will present the differences between the traditional, 5
6 normalized relational model and the hierarchical clustered schema model proposed by F1, using as an example a database that holds data about mobile manufacturers. An example relational schema for a mobile manufacturer database is illustrated in Figure 3. There are 4 tables, for manufacturers, phones, tablets and an additional SIM Support table that tracks the types of SIM cards that can be associated with each mobile phone. In this traditional, normalized relational model, all rows that belong to the same table are usually stored in the same file on disk. Figure 4 shows an example table hierarchy for the mobile database. In this hierarchy, the Manufacturer is the root table, while the Tablet and Phone tables are its child tables. The SIM Support table is a child table of the Phone table as the SIM information is only related to Phones. Figure 4: A possible hierarchy for a mobile manufacturer database The storage of rows on disk for a clustered hierarchical schema is different from the storage layout proposed by the relational schema. While in the relational schema, rows of each table are stored one after the other on disk, in the clustered hierarchical schema, the child rows are stored interleaved with the parent rows. The storage layout for the mobile manufacturer hierarchy is shown in Figure 5. In the example, the rows of the Phone and Tablet tables are stored right after the rows of the corresponding Manufacturer entries and the rows of the SIM Support table are stored right after the corresponding rows from the Phone table. An additional storage constraint set by the hierarchical clustered schema is that all the rows associated to a root row be stored on the same node. This includes not only the direct children, but also the children s children and so on. The main advantage of such a hierarchical schema is that all the rows belonging to a 6
7 single root row are accessible using a range scan starting from that root row. For example, updating attributes belonging to all tablets or phones manufactured by Samsung can be done in a single scan starting from the manufacturer row of Samsung. Because of the constraint stating that all child rows of a root row are stored on the same node, if a transaction needs to apply multiple updates on a root row hierarchy, all the writes are directed to a single node. This is important because multiple updates corresponding to the same transaction can be batched in a single network message. This request batching will be described in the following section. The disadvantages of this hierarchical model is that the domain data needs to manifest a certain hierarchy. If the tables cannot be grouped into such hierarchies, the schema can degenerate into a traditional normalized schema, where all the tables are root tables and no table has any child tables. Such a schema cannot benefit from the advantages of a hierarchical schema. Moreover, the fact that all of the child rows of a root row need to be stored on the same node limits the maximum number of nodes in a root row hierarchy to the storage space available on the node. This could potentially pose a problem for hierarchies in which root rows have very many child rows. Figure 5: The storage layout for the hierarchical schema for mobile manufacturers 6 Request batching This section will detail the request batching proposed by F1 to mitigate the high write latency. In traditional SQL databases, where the data storage and processing are done on the same node, the write latency is mainly caused by disk latency. Additionally, the disk latency is caused by the write capacity of the device and the contention of database processes for the IO device. In the case of the multilayer architecture, the duration of network messages make up an important component of the write latency. This means 7
8 that the write latency can be mitigated by batching multiple write commands in a single network message. For example, if a transaction contains multiple write operations that are applied on data belonging to a single root row, all of these write commands can be batched in a single network message from the query processing node to the key-value store node. Moreover, this batch of updates can be replicated in the same time. Another example are updates that are applied on different root rows that reside on the same node. This case is illustrated in Figure 6, where 2 separate updates need to be applied on root rows stored on the same node. The SQL updates are translated into 2 write commands operating on key-value pairs and these 2 commands can be batched in a single network message sent from the F1 middleware server to the key-value store server. Figure 6: Illustration of the request batching process 7 Drawbacks and Alternatives The system proposed in the F1 paper manages to provide both scalability and strong consistency. However, this comes at a certain cost: Higher single read and write latencies. In this system, read and write commands operating on single rows have a high latency. The authors have reported that the latency of the system for these operations was larger than the latency of the previous database system used for AdWords. However, in the proposed system, reads or writes to the full row hierarchy of a single root row can be done with a single network request as well. Higher resource cost. This architecture requires more physical nodes due to the fact that the SQL query processing and data storage is done on different machines. This means that at least one middleware node is needed for the query processing, without taking into account the key-value pair store nodes. 8
9 Need for hierarchical structure in data The clustered hierarchical data model is a key concept used to reduce the number of write requests associated with each transaction. If the data stored by the database cannot be grouped into an appropriate hierarchy, the reduced latency offered by this storage optimization will not be achieved. In the architecture proposed by F1, the data is remote from the nodes that perform the query processing. An alternative architecture is to keep the data on the same nodes that perform the query processing. The authors of [5] have identified the main bottlenecsk of traditional relational databases to be: write-ahead logging, two-phase locking, data structure latching and buffer management. The database VoltDB [6] was implemented in the spirit of these previous ideas and proposed a distributed in-memory architecture. In this system, nodes are single-threaded, eliminating the need for locking and latching, while the full in-memory architecture simplifies the buffer management process. 8 Conclusions This report has analyzed F1, a distributed SQL database. The authors manage to successfully combine the advantages of SQL and NoSQL systems in a system that provides transparent scalability, strong consistency and very high availability. This is done using a multilayer architecture, in which the query processing components are physically separated from the data storage components. Such an architecture provides good scalability and availability but the additional physical layer impacts the write latency negatively. This additional network latency is mitigated using a clustered hierarchical schema instead of a traditional relational schema. Request batching is also used to group multiple commands into a single network request in order to reduce the impact of network latency. The authors report that the system has been successfully used in production and the user-facing latency of their application is on par with the latency when using the previous database system. References [1] Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina, Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, and Himani Apte. F1: A distributed SQL database that scales. In VLDB, [2] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson C. Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. Spanner: Google s globally distributed database. ACM Trans. Comput. Syst., 31(3):8,
10 [3] Andrew Fikes. Storage architecture and challenges. googleusercontent.com/media/research.google.com/en//university/ relations/facultysummit2010/storage_architecture_and_challenges.pdf, July [4] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. SIGOPS Oper. Syst. Rev., 37(5):29 43, October [5] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker. OLTP Through the Looking Glass, and What We Found There. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 08, pages , New York, NY, USA, ACM. [6] Michael Stonebraker and Ariel Weisberg. The VoltDB main memory DBMS. 10
City University of Hong Kong Information on a Course offered by Department of Computer Science with effect from Semester A in 2014 / 2015
City University of Hong Kong Information on a Course offered by Department of Computer Science with effect from Semester A in 2014 / 2015 Part I Course Title: Data-Intensive Computing Course Code: CS4480
More informationA Taxonomy of Partitioned Replicated Cloud-based Database Systems
A Taxonomy of Partitioned Replicated Cloud-based Database Divy Agrawal University of California Santa Barbara Kenneth Salem University of Waterloo Amr El Abbadi University of California Santa Barbara Abstract
More informationF1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013
F1: A Distributed SQL Database That Scales Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 What is F1? Distributed relational database Built to replace sharded MySQL back-end of AdWords
More informationF1: A Distributed SQL Database That Scales
F1: A Distributed SQL Database That Scales Jeff Shute Radek Vingralek Bart Samwel Ben Handy Chad Whipkey Eric Rollins Mircea Oancea Kyle Littlefield David Menestrina Stephan Ellner John Cieslewicz Ian
More informationHow To Build Cloud Storage On Google.Com
Building Scalable Cloud Storage Alex Kesselman alx@google.com Agenda Desired System Characteristics Scalability Challenges Google Cloud Storage What does a customer want from a cloud service? Reliability
More informationDistributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
More informationA Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
More informationAnalysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
More informationTECHNICAL OVERVIEW HIGH PERFORMANCE, SCALE-OUT RDBMS FOR FAST DATA APPS RE- QUIRING REAL-TIME ANALYTICS WITH TRANSACTIONS.
HIGH PERFORMANCE, SCALE-OUT RDBMS FOR FAST DATA APPS RE- QUIRING REAL-TIME ANALYTICS WITH TRANSACTIONS Overview VoltDB is a fast in-memory relational database system (RDBMS) for high-throughput, operational
More informationHypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
More informationHosting Transaction Based Applications on Cloud
Proc. of Int. Conf. on Multimedia Processing, Communication& Info. Tech., MPCIT Hosting Transaction Based Applications on Cloud A.N.Diggikar 1, Dr. D.H.Rao 2 1 Jain College of Engineering, Belgaum, India
More informationBig Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
More informationMegastore: Providing Scalable, Highly Available Storage for Interactive Services
Megastore: Providing Scalable, Highly Available Storage for Interactive Services J. Baker, C. Bond, J.C. Corbett, JJ Furman, A. Khorlin, J. Larson, J-M Léon, Y. Li, A. Lloyd, V. Yushprakh Google Inc. Originally
More informationIntegrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin N. Silva ysilva@asu.edu Suzanne W. Dietrich dietrich@asu.edu Lisa M. Tsosie lmtsosi1@asu.edu Jason M. Reed jmreed3@asu.edu ABSTRACT An important
More informationData Management Course Syllabus
Data Management Course Syllabus Data Management: This course is designed to give students a broad understanding of modern storage systems, data management techniques, and how these systems are used to
More informationIntegrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
More informationThe Google File System
The Google File System By Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (Presented at SOSP 2003) Introduction Google search engine. Applications process lots of data. Need good file system. Solution:
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationPerformance of Scalable Data Stores in Cloud
Performance of Scalable Data Stores in Cloud Pankaj Deep Kaur, Gitanjali Sharma Abstract Cloud computing has pervasively transformed the way applications utilized underlying infrastructure like systems
More informationA Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems
A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems Ismail Hababeh School of Computer Engineering and Information Technology, German-Jordanian University Amman, Jordan Abstract-
More informationLow-Latency Multi-Datacenter Databases using Replicated Commit
Low-Latency Multi-Datacenter Databases using Replicated Commit Hatem Mahmoud, Faisal Nawab, Alexander Pucher, Divyakant Agrawal, Amr El Abbadi University of California Santa Barbara, CA, USA {hatem,nawab,pucher,agrawal,amr}@cs.ucsb.edu
More informationModularity and Scalability in Calvin
Modularity and Scalability in Calvin Alexander Thomson Google agt@google.com Daniel J. Abadi Yale University dna@cs.yale.edu Abstract Calvin is a transaction scheduling and replication management layer
More informationOnline, Asynchronous Schema Change in F1
Online, Asynchronous Schema Change in F1 Ian Rae University of Wisconsin Madison ian@cs.wisc.edu Eric Rollins Google, Inc. erollins@google.com Jeff Shute Google, Inc. jshute@google.com ABSTRACT Sukhdeep
More informationObjectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation
Objectives Distributed Databases and Client/Server Architecture IT354 @ Peter Lo 2005 1 Understand the advantages and disadvantages of distributed databases Know the design issues involved in distributed
More informationMapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
More informationWhite Paper. Optimizing the Performance Of MySQL Cluster
White Paper Optimizing the Performance Of MySQL Cluster Table of Contents Introduction and Background Information... 2 Optimal Applications for MySQL Cluster... 3 Identifying the Performance Issues.....
More informationData Distribution with SQL Server Replication
Data Distribution with SQL Server Replication Introduction Ensuring that data is in the right place at the right time is increasingly critical as the database has become the linchpin in corporate technology
More information- Behind The Cloud -
- Behind The Cloud - Infrastructure and Technologies used for Cloud Computing Alexander Huemer, 0025380 Johann Taferl, 0320039 Florian Landolt, 0420673 Seminar aus Informatik, University of Salzburg Overview
More informationThe Google File System
The Google File System Motivations of NFS NFS (Network File System) Allow to access files in other systems as local files Actually a network protocol (initially only one server) Simple and fast server
More informationA programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
More informationAn Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
More informationThe Sierra Clustered Database Engine, the technology at the heart of
A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 5 - DBMS Architecture
CSE 544 Principles of Database Management Systems Magdalena Balazinska Fall 2007 Lecture 5 - DBMS Architecture References Anatomy of a database system. J. Hellerstein and M. Stonebraker. In Red Book (4th
More informationDistributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
More informationCommunication System Design Projects
Communication System Design Projects PROFESSOR DEJAN KOSTIC PRESENTER: KIRILL BOGDANOV KTH-DB Geo Distributed Key Value Store DESIGN AND DEVELOP GEO DISTRIBUTED KEY VALUE STORE. DEPLOY AND TEST IT ON A
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationNot Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)
Not Relational Models For The Management of Large Amount of Astronomical Data Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF) What is a DBMS A Data Base Management System is a software infrastructure
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationlow-level storage structures e.g. partitions underpinning the warehouse logical table structures
DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures
More informationBig Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive
Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive E. Laxmi Lydia 1,Dr. M.Ben Swarup 2 1 Associate Professor, Department of Computer Science and Engineering, Vignan's Institute
More informationBig Table A Distributed Storage System For Data
Big Table A Distributed Storage System For Data OSDI 2006 Fay Chang, Jeffrey Dean, Sanjay Ghemawat et.al. Presented by Rahul Malviya Why BigTable? Lots of (semi-)structured data at Google - - URLs: Contents,
More informationFIT: A Distributed Database Performance Tradeoff
FIT: A Distributed Database Performance Tradeoff Jose M. Faleiro Yale University jose.faleiro@yale.edu Daniel J. Abadi Yale University dna@cs.yale.edu Abstract Designing distributed database systems is
More informationCumuloNimbo: A Cloud Scalable Multi-tier SQL Database
CumuloNimbo: A Cloud Scalable Multi-tier SQL Database Ricardo Jimenez-Peris Univ. Politécnica de Madrid Ivan Brondino Univ. Politécnica de Madrid Marta Patiño-Martinez Univ. Politécnica de Madrid José
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationOn the Design and Scalability of Distributed Shared-Data Databases
On the Design and Scalability of Distributed Shared-Data Databases Simon Loesing Markus Pilman Thomas Etter Donald Kossmann Department of Computer Science Microsoft Research ETH Zurich, Switzerland Redmond,
More informationAmr El Abbadi. Computer Science, UC Santa Barbara amr@cs.ucsb.edu
Amr El Abbadi Computer Science, UC Santa Barbara amr@cs.ucsb.edu Collaborators: Divy Agrawal, Sudipto Das, Aaron Elmore, Hatem Mahmoud, Faisal Nawab, and Stacy Patterson. Client Site Client Site Client
More informationComparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in
More informationAmerican International Journal of Research in Science, Technology, Engineering & Mathematics
American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629
More informationTECHNIQUES FOR DATA REPLICATION ON DISTRIBUTED DATABASES
Constantin Brâncuşi University of Târgu Jiu ENGINEERING FACULTY SCIENTIFIC CONFERENCE 13 th edition with international participation November 07-08, 2008 Târgu Jiu TECHNIQUES FOR DATA REPLICATION ON DISTRIBUTED
More informationHow To Write A Database Program
SQL, NoSQL, and Next Generation DBMSs Shahram Ghandeharizadeh Director of the USC Database Lab Outline A brief history of DBMSs. OSs SQL NoSQL 1960/70 1980+ 2000+ Before Computers Database DBMS/Data Store
More informationA Brief Analysis on Architecture and Reliability of Cloud Based Data Storage
Volume 2, No.4, July August 2013 International Journal of Information Systems and Computer Sciences ISSN 2319 7595 Tejaswini S L Jayanthy et al., Available International Online Journal at http://warse.org/pdfs/ijiscs03242013.pdf
More informationSAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
More information16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationHadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
More informationOne-Size-Fits-All: A DBMS Idea Whose Time has Come and Gone. Michael Stonebraker December, 2008
One-Size-Fits-All: A DBMS Idea Whose Time has Come and Gone Michael Stonebraker December, 2008 DBMS Vendors (The Elephants) Sell One Size Fits All (OSFA) It s too hard for them to maintain multiple code
More informationExploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
More informationA Comparison of Approaches to Large-Scale Data Analysis
A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce
More informationMS-40074: Microsoft SQL Server 2014 for Oracle DBAs
MS-40074: Microsoft SQL Server 2014 for Oracle DBAs Description This four-day instructor-led course provides students with the knowledge and skills to capitalize on their skills and experience as an Oracle
More informationData Management in the Cloud
Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
More informationBig Data and Hadoop with Components like Flume, Pig, Hive and Jaql
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationIn-Memory Columnar Databases HyPer. Arto Kärki University of Helsinki 30.11.2012
In-Memory Columnar Databases HyPer Arto Kärki University of Helsinki 30.11.2012 1 Introduction Columnar Databases Design Choices Data Clustering and Compression Conclusion 2 Introduction The relational
More informationTier Architectures. Kathleen Durant CS 3200
Tier Architectures Kathleen Durant CS 3200 1 Supporting Architectures for DBMS Over the years there have been many different hardware configurations to support database systems Some are outdated others
More informationSCHEDULING IN CLOUD COMPUTING
SCHEDULING IN CLOUD COMPUTING Lipsa Tripathy, Rasmi Ranjan Patra CSA,CPGS,OUAT,Bhubaneswar,Odisha Abstract Cloud computing is an emerging technology. It process huge amount of data so scheduling mechanism
More informationAffordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale
WHITE PAPER Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale Sponsored by: IBM Carl W. Olofson December 2014 IN THIS WHITE PAPER This white paper discusses the concept
More informationConcepts of Database Management Seventh Edition. Chapter 7 DBMS Functions
Concepts of Database Management Seventh Edition Chapter 7 DBMS Functions Objectives Introduce the functions, or services, provided by a DBMS Describe how a DBMS handles updating and retrieving data Examine
More informationMicrosoft SQL Server for Oracle DBAs Course 40045; 4 Days, Instructor-led
Microsoft SQL Server for Oracle DBAs Course 40045; 4 Days, Instructor-led Course Description This four-day instructor-led course provides students with the knowledge and skills to capitalize on their skills
More informationComparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationMassive Data Storage
Massive Data Storage Storage on the "Cloud" and the Google File System paper by: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung presentation by: Joshua Michalczak COP 4810 - Topics in Computer Science
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationCHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL
CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL This chapter is to introduce the client-server model and its role in the development of distributed network systems. The chapter
More informationPractical Cassandra. Vitalii Tymchyshyn tivv00@gmail.com @tivv00
Practical Cassandra NoSQL key-value vs RDBMS why and when Cassandra architecture Cassandra data model Life without joins or HDD space is cheap today Hardware requirements & deployment hints Vitalii Tymchyshyn
More informationFuture Prospects of Scalable Cloud Computing
Future Prospects of Scalable Cloud Computing Keijo Heljanko Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 7.3-2012 1/17 Future Cloud Topics Beyond
More informationParallel & Distributed Data Management
Parallel & Distributed Data Management Kai Shen Data Management Data management Efficiency: fast reads/writes Durability and consistency: data is safe and sound despite failures Usability: convenient interfaces
More informationLogistics. Database Management Systems. Chapter 1. Project. Goals for This Course. Any Questions So Far? What This Course Cannot Do.
Database Management Systems Chapter 1 Mirek Riedewald Many slides based on textbook slides by Ramakrishnan and Gehrke 1 Logistics Go to http://www.ccs.neu.edu/~mirek/classes/2010-f- CS3200 for all course-related
More informationImprove Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database
WHITE PAPER Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive
More informationHigh-performance metadata indexing and search in petascale data storage systems
High-performance metadata indexing and search in petascale data storage systems A W Leung, M Shao, T Bisson, S Pasupathy and E L Miller Storage Systems Research Center, University of California, Santa
More informationA Distribution Management System for Relational Databases in Cloud Environments
JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY, VOL. 11, NO. 2, JUNE 2013 169 A Distribution Management System for Relational Databases in Cloud Environments Sze-Yao Li, Chun-Ming Chang, Yuan-Yu Tsai, Seth
More informationJournal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
More informationSQL Server 2014 New Features/In- Memory Store. Juergen Thomas Microsoft Corporation
SQL Server 2014 New Features/In- Memory Store Juergen Thomas Microsoft Corporation AGENDA 1. SQL Server 2014 what and when 2. SQL Server 2014 In-Memory 3. SQL Server 2014 in IaaS scenarios 2 SQL Server
More informationDistributed Systems. Tutorial 12 Cassandra
Distributed Systems Tutorial 12 Cassandra written by Alex Libov Based on FOSDEM 2010 presentation winter semester, 2013-2014 Cassandra In Greek mythology, Cassandra had the power of prophecy and the curse
More informationHow to Build a High-Performance Data Warehouse By David J. DeWitt, Ph.D.; Samuel Madden, Ph.D.; and Michael Stonebraker, Ph.D.
1 How To Build a High-Performance Data Warehouse How to Build a High-Performance Data Warehouse By David J. DeWitt, Ph.D.; Samuel Madden, Ph.D.; and Michael Stonebraker, Ph.D. Over the last decade, the
More informationCitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
More informationAzure Scalability Prescriptive Architecture using the Enzo Multitenant Framework
Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework Many corporations and Independent Software Vendors considering cloud computing adoption face a similar challenge: how should
More informationChapter 3. Database Environment - Objectives. Multi-user DBMS Architectures. Teleprocessing. File-Server
Chapter 3 Database Architectures and the Web Transparencies Database Environment - Objectives The meaning of the client server architecture and the advantages of this type of architecture for a DBMS. The
More informationWITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE
WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE 1 W W W. F U S I ON I O.COM Table of Contents Table of Contents... 2 Executive Summary... 3 Introduction: In-Memory Meets iomemory... 4 What
More informationGraph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
More informationA B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION
Speed- Up Extension To Hadoop System- A Survey Of HDFS Data Placement Sayali Ashok Shivarkar, Prof.Deepali Gatade Computer Network, Sinhgad College of Engineering, Pune, India 1sayalishivarkar20@gmail.com
More informationHow In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
More informationStay Tuned for Today s Session! NAVIGATING THE DATABASE UNIVERSE"
Stay Tuned for Today s Session! NAVIGATING THE DATABASE UNIVERSE" Dr. Michael Stonebraker and Scott Jarr! Navigating the Database Universe" A Few Housekeeping Items! Remember to mute your line! Type your
More informationLecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
More informationextensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010
System/ Scale to Primary Secondary Joins/ Integrity Language/ Data Year Paper 1000s Index Indexes Transactions Analytics Constraints Views Algebra model my label 1971 RDBMS O tables sql-like 2003 memcached
More informationBasics Of Replication: SQL Server 2000
Basics Of Replication: SQL Server 2000 Table of Contents: Replication: SQL Server 2000 - Part 1 Replication Benefits SQL Server Platform for Replication Entities for the SQL Server Replication Model Entities
More informationApache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
More information