Distribution, Data, Deployment Software Architecture Convergence in Big Data Systems
|
|
|
- Agatha Fields
- 10 years ago
- Views:
Transcription
1 FEATURE: BIG DATA SOFTWARE SYSTEMS Distribution, Data, Deployment Software Architecture Convergence in Big Data Systems Ian Gorton and John Klein, Software Engineering Institute // Big data systems present many challenges to software architects. In particular, distributed software architectures become tightly coupled to data and deployment architectures. This causes a consolidation of concerns; designs must be harmonized across these three architectures to satisfy quality requirements. // THE EXPONENTIAL GROWTH of data over the last decade has fueled a new specialization for software technology: data-intensive, or big data, software systems. 1 Internet-born organizations such as Google and Amazon are on this revolution s cutting edge, collecting, managing, storing, and analyzing some of the largest data repositories ever constructed. Their pioneering efforts, 2,3 along with those of numerous other big data innovators, have provided a variety of open source and commercial data management technologies that let any organization construct and operate massively scalable, highly available data repositories. Addressing the challenges of software for big data systems requires careful design tradeoffs spanning the distributed software, data, and deployment architectures. It also requires extending traditional software architecture design knowledge to account for the tight coupling that exists in scalable systems. Scale drives a consolidation of concerns, so that distribution, data, and deployment architectural qualities can no longer be effectively considered separately. To illustrate this, we ll use an example from our current research in healthcare informatics. The Challenges of Big Data Data-intensive systems have long been built on SQL database technology, which relies primarily on vertical scaling faster processors and bigger disks as workload or storage requirements increase. SQL databases inherent vertical-scaling limitations 4 have led to new products that relax many core tenets of relational databases. Strictly defined normalized data models, strong data consistency guarantees, and the SQL standard have been replaced by schemaless and intentionally denormalized data models, weak consistency, and proprietary APIs that expose the underlying data management mechanisms to the programmer. These NoSQL products 4 typically are designed to scale horizontally across clusters of low-cost, moderate- performance servers. They achieve high performance, elastic storage capacity, and availability by partitioning and replicating datasets across a cluster. Prominent examples of NoSQL databases include Cassandra, Riak, and MongoDB (see the sidebar NoSQL Databases ). Distributed databases have funda- 78 IEEE SOFTWARE PUBLISHED BY THE IEEE COMPUTER SOCIETY /15/$ IEEE
2 NOSQL DATABASES The rise of big data applications has caused a signifi cant fl ux in database technologies. While mature relational database technologies continue to evolve, a spectrum of databases called NoSQL has emerged over the past decade. The relational model imposes a strict schema, which inhibits data evolution and causes diffi culties scaling across clusters. In response, NoSQL databases have adopted simpler data models. Common features include schemaless records, allowing data models to evolve dynamically, and horizontal scaling, by sharding (partitioning and distributing) and replicating data collections across large clusters. Figure A illustrates the four most prominent data models, whose characteristics we summarize here. More comprehensive information is at nosql-database.org. Document databases (see Figure A1) store collections of objects, typically encoded using JSON (JavaScript Object Notation) or XML. Documents have keys, and you can build secondary indexes on nonkey fi elds. Document formats are self-describing; a collection might include documents with different formats. Leading examples are MongoDB ( and CouchDB ( Key value databases (see Figure A2) implement a distributed hash map. Records can be accessed primarily through key searches, and the value associated with each key is treated as opaque, requiring reader interpretation. This simple model facilitates sharding and replication to create highly scalable and available systems. Examples are Riak ( and DynamoDB ( dynamodb). Column-oriented databases (see Figure A3) extend the key value model by organizing keyed records as a collection of columns, where a column is a key value pair. The key becomes the column name; the value can be an arbitrary data type such as a JSON document or binary image. A collection can contain records with different numbers of columns. Examples are HBase ( and Cassandra ( /cassandra.apache.org). Graph databases (see Figure A4) organize data in a highly connected structure typically, some form of directed graph. They can provide exceptional performance for problems involving graph traversals and subgraph matching. Because effi cient graph partitioning is an NP-hard problem, these databases tend to be less concerned with horizontal scaling and commonly offer ACID (atomicity, consistency, isolation, durability) transactions to provide strong consistency. Examples include Neo4j ( and GraphBase ( NoSQL technologies have many implications for application design. Because there s no equivalent of SQL, each technology supports its own query mechanism. These mechanisms typically make the application programmer responsible for explicitly formulating query executions, rather than relying on query planners that execute queries based on declarative specifi cations. The programmer is also responsible for combining results from different data collections. This lack of the ability to perform JOINs forces extensive denormalization of data models so that JOIN-style queries can be effi ciently executed by accessing a single data collection. When databases are sharded and replicated, the programmer also must manage consistency when concurrent updates occur and must design applications to tolerate stale data due to latency in update replication. id : 1, Name : John, Employer : SEI id : 2, Name : Ian, Employer : SEI, Previous : PNNL (1) key : 1, value{ Name : John, Employer : SEI } key : 2, value{ Name : Ian, Employer : SEI, Previous : PNNL } (2) row : 1, Employer Name SEI John row : 2, Employer Name Previous SEI Ian PNNL (3) Node: Employee is employed by Node: Employer id : 1, Name : John Name : SEI id : 2, Name : Ian Name : PNNL previously employed by (4) FIGURE A. Four major NoSQL data models. (1) A document store. (2) A key value store. (3) A column store. (4) A graph store. MAY/JUNE 2015 IEEE SOFTWARE 79
3 FEATURE: BIG DATA SOFTWARE SYSTEMS mental quality constraints, defined by Eric Brewer s CAP (consistency, availability, partition tolerance) theorem. 5 When a network partition occurs (causing an arbitrary message loss between cluster nodes), a system must trade consistency (all readers see the same data) against availability (every request receives a success or failure response). Daniel Abadi s PACELC provides a practical interpretation of this theorem. 6 If a partition (P) occurs, a system must trade availability (A) against consistency (C). Else (E), in the usual case of no partition, a system must trade latency (L) against consistency (C). Additional design challenges for scalable data-intensive systems stem from the following three issues. First, achieving high scalability and availability leads to highly distributed systems. Distribution occurs in all tiers, from webserver farms and caches to back-end storage. Second, the abstraction of a single system image, with transactional writes and consistent reads using SQL-like query languages, is difficult to achieve at scale. 7 Applications must be aware of data replicas; handle inconsistencies from conflicting replica updates; and continue operation in spite of inevitable processor, network, and software failures. Third, each NoSQL product embodies a specific set of quality attribute tradeoffs, especially in terms of performance, scalability, durability, and consistency. Architects must diligently evaluate candidate database technologies and select databases that can satisfy application requirements. This often leads to polyglot persistence using different database technologies to store different datasets in a single system, to meet quality attribute requirements. 8 Furthermore, as data volumes grow to petascale and beyond, the required hardware resources grow from hundreds to tens of thousands of servers. At this deployment scale, many widely used software architecture patterns are unsuitable. Architectural and algorithmic approaches that are sensitive to hardware resource use can significantly reduce overall costs. For more on this, see the sidebar Why Scale Matters. Big data systems must be able to sustain write-heavy workloads. Big Data Application Characteristics Big data applications are rapidly becoming pervasive across a range of business domains. One example domain in which big data analytics looms prominently on the horizon is aeronautics. Modern commercial airliners produce approximately 0.5 Tbytes of operational data per flight. 9 This data can be used to diagnose faults in real time, optimize fuel consumption, and predict maintenance needs. Airlines must build scalable systems to capture, manage, and analyze this data to improve reliability and reduce costs. Another domain is healthcare. Big data analytics in US healthcare could save an estimated $450 billion. 10 Analysis of petabytes of data across patient populations, taken from diverse sources such as insurance payers, public health entities, and clinical studies, can reduce costs by improving patient outcomes. In addition, operational efficiencies can extract new insights for disease treatment and prevention. Across these and many other domains, big data systems share four requirements that drive the design of suitable software solutions. Collectively, these requirements represent a significant departure from traditional business systems, which are relatively well constrained in terms of data growth, analytics, and scale. First, from social media sites to high-resolution sensor data collection in the power grid, big data systems must be able to sustain writeheavy workloads. 1 Because writes are costlier than reads, systems can use data sharding (partitioning and distribution) to spread write operations across disks and can use replication to provide high availability. ing and replication introduce availability and consistency issues that the systems must address. The second requirement is to deal with variable request loads. Business and government systems experience highly variable workloads for reasons including product promotions, emergencies, and statutory deadlines such as tax submissions. To avoid the costs of overprovisioning to handle these occasional spikes, cloud platforms are elastic, letting applications add processing capacity to share loads and release resources when loads drop. Effectively exploiting this deployment mechanism requires architectures with application- specific strategies to detect increased loads, rapidly add new resources, and release them as necessary. 80 IEEE SOFTWARE
4 WHY SCALE MATTERS Scale has many implications for software architecture; here we look at several. The first implication focuses on how scale changes our designs problem space. Big data systems are inherently distributed. Their architectures must explicitly handle partial failures, communication latencies, concurrency, consistency, and replication. As systems grow to thousands of processing nodes and disks and become geographically distributed, these issues are exacerbated as the probability of a hardware failure increases. One study found that 8 percent of servers in a typical datacenter experience a hardware problem annually, with disk failure most common. 1 Applications must also deal with unpredictable communication latencies and network connection failures. Scalable applications must treat failures as common events that are handled gracefully to ensure uninterrupted operation. To deal with these issues, resilient architectures must fulfill two requirements. First, they must replicate data to ensure availability in the case of a disk failure or network partition. Replicas must be kept strictly or eventually consistent, using either master slave or multimaster protocols. The latter need mechanisms such as Lamport clocks 2 to resolve inconsistencies due to concurrent writes. Second, architecture components must be stateless, replicated, and tolerant of failures of dependent services. For example, by using the Circuit Breaker pattern 3 and returning cached or default results whenever failures are detected, an architecture limits failures and allows time for recovery. Another implication is economics based. At very large scales, small optimizations in resource use can lead to very large cost reductions in absolute terms. Big data systems can use many thousands of servers and disks. Whether these are capital purchases or rented from a service provider, they remain a major cost and hence a target for reduction. Elasticity can reduce resource use by dynamically deploying new servers as the load increases and releasing them as the load decreases. This requires servers that boot and initialize quickly and applicationspecific strategies to avoid premature resource release. Other strategies target the development tool chain to maintain developer productivity while decreasing resource use. For example, Facebook built HipHop, a PHP-to-C++ transformation engine that reduced the CPU load for serving Web pages by 50 percent. 4 At the scale of Facebook s deployment, this creates significant operational-cost savings. Other targets for reduction are software license costs, which can be prohibitive at scale. This has led some organizations to create custom database and middleware technologies, many of which have been released as open source. Leading examples of technologies for big data systems are from Netflix ( netflix.github.io) and LinkedIn ( Other implications of scale include testing and fault diagnosis. Owing to these systems deployment footprints and the massive datasets they manage, comprehensively validating code before deployment to production can be impossible. Canary testing and Netflix s Simian Army are examples of the state of the art in testing at scale. 5 When problems occur in production, advanced monitoring and logging are needed for rapid diagnosis. In large-scale systems, log collection and analysis itself quickly becomes a big data problem. Solutions must include a low-overhead, scalable logging infrastructure such as Blitz4j. 6 References 1. K.V. Vishwanath and N. Nagappan, Characterizing Cloud Computing Hardware Reliability, Proc. 1st ACM Symp. Cloud Computing (SoCC 10), 2010, pp L. Lamport, Time, Clocks, and the Ordering of Events in a Distributed System, Comm. ACM, vol. 21, no. 7, 1978, pp M.T. Nygard, Release It! Design and Deploy Production-Ready Software, Pragmatic Bookshelf, H. Zhao, HipHop for PHP: Move Fast, blog, 2 Feb. 2010; developers.facebook.com/blog/post/2010/02/02/hiphop-for-php --move-fast. 5. B. Schmaus, Deploying the Netflix API, blog, 14 Aug. 2013; techblog.netflix.com/2013/08/deploying-netflix-api.html. 6. K. Ranganathan, Announcing Blitz4j a Scalable Logging Framework, blog, 20 Nov. 2012; /appender. The third requirement is to support computation-intensive analytics. Most big data systems must support diverse query workloads, mixing requests that require rapid responses with long-running requests that perform complex ana- lytics on significant portions of the data collection. This leads to software and data architectures explicitly structured to meet these varying latency demands. Netflix s Recommendations Engine is a pioneering example of how to design software and data architectures to partition simultaneously between low-latency requests and requests for advanced analytics on large data collections, to continually enhance personalized recommendations quality. 11 MAY/JUNE 2015 IEEE SOFTWARE 81
5 FEATURE: BIG DATA SOFTWARE SYSTEMS 1 1 Web or application servers 2 Patient data 2 Test results data Datacenter The fourth requirement is high availability. With thousands of nodes in a horizontally scaled deployment, hardware and network failures inevitably occur. So, distributed software and data architectures must be resilient. Common approaches for high availability include replicating data across geographical regions, 12 stateless services, and application-specific mechanisms to provide degraded service in the face of failures. These requirements solutions crosscut the distribution, data, and deployment architectures. For example, elasticity requires processing capacity that can be acquired from the execution platform on demand, policies and mechanisms to appropriately start and stop services as the application load varies, and a database architecture that can Web or application servers Global network Data replication Web or application servers Patient data 2 Test results data Datacenter FIGURE 1. A MongoDB-based healthcare data management prototype. Geographically distributed datacenters increase availability and reduce latency for globally distributed users. reliably satisfy queries under an increased load. This coupling of architectures to satisfy a particular quality attribute is common in big data applications. It can be regarded as a tight coupling of the process, logical, and physical views in the view model. 13 An Example of the Consolidation of Concerns At the Software Engineering Institute, we re helping to evolve a system to aggregate data from multiple petascale medical-record databases for clinical applications. To attain high scalability and availability at low cost, we re investigating using NoSQL databases for this aggregation. The design uses geographically distributed datacenters to increase availability and reduce latency for globally distributed users. Consider the consistency requirements for two categories of data in this system: patient demographics (for example, name and insurance provider) and diagnostic-test results (for example, blood or imaging test results). Patient demographic records are updated infrequently. These updates must be immediately visible at the local site where the data was modified ( read your writes ), but a delay is acceptable before the update is visible at other sites ( eventual replica consistency ). In contrast, diagnostic- test results are updated more frequently, and changes must be immediately visible everywhere to support telemedicine and remote consultations with specialists ( strong replica consistency ). We re prototyping solutions using several NoSQL databases. We focus here on one prototype using MongoDB to illustrate the architecture drivers and design decisions. The design segments data across three shards and replicates data across two datacenters (see Figure 1). MongoDB enforces a master slave architecture; every data collection has a master replica that serves all write requests and propagates changes to other replicas. Clients can read from any replica, opening an inconsistency window between writes to the master and reads from other replicas. MongoDB allows tradeoffs between consistency and latency through parameter options on each write and read. A write can be unacknowledged (no assurance of durability, low latency), durable on the master replica, or durable on the master and one or more replicas (consistent, high latency). A read can prefer the closest replica (potentially inconsistent, low latency), be restricted to the master replica (consis- 82 IEEE SOFTWARE
6 tent, partition intolerant), or require most replicas to agree on the data value to be read (consistent, partition tolerant). The application developer must choose appropriate write and read options to achieve the desired performance, consistency, and durability and must handle partition errors to achieve the desired availability. In our example, patient demographic data writes must be durable on the primary replica, but reads can be directed to the closest replica for low latency. This makes patient demographic reads insensitive to network partitions at the cost of potentially inconsistent responses. In contrast, writes for diagnostictest results must be durable on all replicas. Reads can be performed from the closest replica because the write ensures that all replicas are consistent. This means writes must handle failures caused by network partitions, whereas reads are insensitive to partitions. Today, our healthcare informatics application runs atop an SQL database, which hides the physical data model and deployment topology from developers. SQL databases provide a single-system-image abstraction, which separates concerns between the application and database by hiding the details of data distribution across processors, storage, and networks behind a transactional read/write interface. 14 In shifting to a NoSQL environment, an application must directly handle the faults that will depend on the physical data distribution (sharding and replication) and the number of replica sites and servers. These low-level infrastructure concerns, traditionally hidden under the database interface, must now be explicitly handled in application logic. Data Distribution Control resource demand Deployment Reduce overhead Distributed webserver caching to reduce database read load Performance Increase resources Database partitioning to distribute read load Replicated stateless webservers Manage resources Increase concurrency Manage multiple copies of data Data model denormalized to support single query per use case Replicated database across clusters FIGURE 2. Performance tactics for big data systems. The design decisions span the data, distribution, and deployment architectures. Systematic Design Using Tactics In designing an architecture to satisfy quality drivers such as those in this healthcare example, one proven approach is to systematically select and apply a sequence of architecture tactics. 15 Tactics are elemental design decisions, embodying architectural knowledge of how to satisfy one design concern of a quality attribute. Tactic catalogs enable reuse of this knowledge. However, existing catalogs don t contain tactics specific to big data systems. Figures 2 and 3 extend the basic performance and availability tactics 15 to big data systems. Figure 4 defines scalability tactics, focusing on the design concern of increased workloads. Each figure shows how the design decisions span the data, distribution, and deployment architectures. For example, achieving availability requires masking faults that inevitably occur in a distributed system. At the data level, replicating data items is an essential step to handle network partitions. When an application can t access any database partition, another tactic to enhance availability is to design a data model that can return meaningful default values without accessing the data. At the distributed-software layer, caching is a tactic to achieve the default- values tactic defined in the data model. At the deployment layer, an availability tactic is to geographically replicate the data and distributed application software layers to protect against power and network outages. Each of these tactics is necessary to handle the different types of faults that threaten availability. Their combined representation in Figure 3 provides architects with comprehensive guidance to achieve highly available systems. MAY/JUNE 2015 IEEE SOFTWARE 83
7 FEATURE: BIG DATA SOFTWARE SYSTEMS Data Detect faults Health monitoring Distribution Deployment Handle write inconsistencies Balance data distribution Availability Data model supports replication to eliminate single point of failure Data replicas reside in geographically distributed datacenters Recover from faults Mask faults Model for default data values Handle database failures through caching or employing default responses FIGURE 3. Availability tactics for big data systems. The tactics combined representation provides architects with comprehensive guidance to achieve highly available systems. Big data applications are pushing the limits of software engineering on multiple horizons. Successful solutions span the design of the data, distribution, and deployment architectures. The body of software architecture knowledge must evolve to capture this advanced design knowledge for big data systems. This article is a first step on this path. Our research is proceeding in two complementary directions. First, we re expanding our collection of architecture tactics and encoding them in an environment that supports navigation between quality attributes and tactics, making crosscutting concerns for design choices explicit. Second, we re linking tactics to design solutions based on specific big data technologies, enabling architects to rapidly relate a particular technology s capabilities to a specific set of tactics. Respond to increased load Automatically increase capacity Data Distribution Deployment Server resources are provisioned on demand as load increases Scalability Manually increase capacity Data distribution mechanism supports distributing data over new nodes Cluster management platform supports dynamic provisioning Respond to decreased load Automatically release capacity Server resources are released as load decreases Cluster management platform supports dynamic decommissioning of nodes FIGURE 4. Scalability tactics for big data systems. These tactics focus on the design concern of increased workloads. Acknowledgments This material is based on research funded and supported by the US Department of Defense under contract FA C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. References herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, doesn t necessarily constitute or imply its endorsement, recommendation, or favoring by Carnegie Mellon University or its Software Engineering Institute. This material has been approved for public release and unlimited distribution. DM References 1. D. Agrawal, S. Das, and A. El Abbadi, Big Data and Cloud Computing: Current State and Future Opportunities, Proc. 14th Int l Conf. Extending Database Technology (EDBT/ICDT 11), 2011, pp IEEE SOFTWARE
8 ABOUT THE AUTHORS 2. W. Vogels, Amazon DynamoDB a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications, blog, 18 Jan. 2012; buted.com/2012/01/amazon-dynamodb.html. 3. F. Chang et al., Bigtable: A Distributed Storage System for Structured Data, ACM Trans. Computing Systems, vol. 26, no. 2, 2008, article P.J. Sadalage and M. Fowler, NoSQL Distilled, Addison-Wesley Professional, E. Brewer, CAP Twelve Years Later: How the Rules Have Changed, Computer, vol. 45, no. 2, 2012, pp D.J. Abadi, Consistency Tradeoffs in Modern Distributed Database System Design: CAP Is Only Part of the Story, Computer, vol. 45, no. 2, 2012, pp J. Shute et al., F1: A Distributed SQL Database That Scales, Proc. VLDB Endowment, vol. 6, no. 11, 2013, pp M. Fowler, PolyglotPersistence, blog, 16 Nov. 2011; /PolyglotPersistence.html. 9. M. Finnegan, Boeing 787s to Create Half a Terabyte of Data per Flight, Says Virgin Atlantic, Computerworld UK, 6 Mar. 2013; /infrastructure/ /boeing-787s-to -create-half-a-terabyte-of-data-per-fl ight -says-virgin-atlantic. 10. B. Kayyali, D. Knott, and S. Van Kuiken, The Big Data Revolution in Healthcare: Accelerating Value and Innovation, McKinsey & Co., 2013; /insights/health_systems_and_services/the _big data_revolution_in_us_health_care. 11. X. Amatriain and J. Basilico, System Architectures for Personalization and Recommendation, blog, 27 Mar. 2013; techblog.netfl ix.com/2013/03/system -architectures-for.html. 12. M. Armbrust, A View of Cloud Computing, Comm. ACM, vol. 53, no. 4, 2010, pp P.B. Kruchten, The View Model of Architecture, IEEE Software, vol. 12, no. 6, 1995, pp J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann, L. Bass, P. Clements, and R. Kazman, Software Architecture in Practice, 3rd ed., Addison-Wesley, IAN GORTON is a senior member of the technical staff on the Carnegie Mellon Software Engineering Institute s Architecture Practices team, where he investigates issues related to software architecture at scale. This includes designing large-scale data management and analytics systems and understanding the inherent connections and tensions between software, data, and deployment architectures. Gorton received a PhD in computer science from Sheffield Hallam University. He s a senior member of the IEEE Computer Society. Contact him at [email protected]. JOHN KLEIN is a senior member of the technical staff at the Carnegie Mellon Software Engineering Institute, where he does consulting and research on scalable software systems as a member of the Architecture Practices team. Klein received an ME in electrical engineering from Northeastern University. He s the secretary of the International Federation for Information Processing Working Group 2.10 on Software Architecture, a member of the IEEE Computer Society, and a senior member of ACM. Contact him at [email protected]. Take the CS Library wherever you go! IEEE Computer Society magazines and Transactions are now available to subscribers in the portable epub format. Just download the articles from the IEEE Computer Society Digital Library, and you can read them on any device that supports epub. For more information, including a list of compatible devices, visit Selected CS articles and columns are also available for free at M AY/J U N E I E E E S O F T WA R E 85
A Systematic Method for Big Data Technology Selection
A Systematic Method for Big Data Technology Selection John Klein Software Solutions Conference 2015 November 16 18, 2015 Copyright 2015 Carnegie Mellon University This material is based upon work funded
NoSQL Databases. Nikos Parlavantzas
!!!! NoSQL Databases Nikos Parlavantzas Lecture overview 2 Objective! Present the main concepts necessary for understanding NoSQL databases! Provide an overview of current NoSQL technologies Outline 3!
Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world
Analytics March 2015 White paper Why NoSQL? Your database options in the new non-relational world 2 Why NoSQL? Contents 2 New types of apps are generating new types of data 2 A brief history of NoSQL 3
Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1
Why NoSQL? Your database options in the new non- relational world 2015 IBM Cloudant 1 Table of Contents New types of apps are generating new types of data... 3 A brief history on NoSQL... 3 NoSQL s roots
NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015
NoSQL Databases Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015 Database Landscape Source: H. Lim, Y. Han, and S. Babu, How to Fit when No One Size Fits., in CIDR,
NoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
Cloud Scale Distributed Data Storage. Jürmo Mehine
Cloud Scale Distributed Data Storage Jürmo Mehine 2014 Outline Background Relational model Database scaling Keys, values and aggregates The NoSQL landscape Non-relational data models Key-value Document-oriented
Introduction to NOSQL
Introduction to NOSQL Université Paris-Est Marne la Vallée, LIGM UMR CNRS 8049, France January 31, 2014 Motivations NOSQL stands for Not Only SQL Motivations Exponential growth of data set size (161Eo
Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB
Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what
NoSQL Databases. Polyglot Persistence
The future is: NoSQL Databases Polyglot Persistence a note on the future of data storage in the enterprise, written primarily for those involved in the management of application development. Martin Fowler
extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010
System/ Scale to Primary Secondary Joins/ Integrity Language/ Data Year Paper 1000s Index Indexes Transactions Analytics Constraints Views Algebra model my label 1971 RDBMS O tables sql-like 2003 memcached
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Do Relational Databases Belong in the Cloud? Michael Stiefel www.reliablesoftware.com [email protected]
Do Relational Databases Belong in the Cloud? Michael Stiefel www.reliablesoftware.com [email protected] How do you model data in the cloud? Relational Model A query operation on a relation
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)
Not Relational Models For The Management of Large Amount of Astronomical Data Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF) What is a DBMS A Data Base Management System is a software infrastructure
SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford
SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems
Benchmarking and Analysis of NoSQL Technologies
Benchmarking and Analysis of NoSQL Technologies Suman Kashyap 1, Shruti Zamwar 2, Tanvi Bhavsar 3, Snigdha Singh 4 1,2,3,4 Cummins College of Engineering for Women, Karvenagar, Pune 411052 Abstract The
Big Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 4. Basic Principles Doc. RNDr. Irena Holubova, Ph.D. [email protected] http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ NoSQL Overview Main objective:
NoSQL Systems for Big Data Management
NoSQL Systems for Big Data Management Venkat N Gudivada East Carolina University Greenville, North Carolina USA Venkat Gudivada NoSQL Systems for Big Data Management 1/28 Outline 1 An Overview of NoSQL
multiparadigm programming Multiparadigm Data Storage for Enterprise Applications
focus multiparadigm programming Multiparadigm Data Storage for Enterprise Applications Debasish Ghosh, Anshin Software Storing data the same way it s used in an application simplifies the programming model,
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
MongoDB in the NoSQL and SQL world. Horst Rechner [email protected] Berlin, 2012-05-15
MongoDB in the NoSQL and SQL world. Horst Rechner [email protected] Berlin, 2012-05-15 1 MongoDB in the NoSQL and SQL world. NoSQL What? Why? - How? Say goodbye to ACID, hello BASE You
Data Modeling for Big Data
Data Modeling for Big Data by Jinbao Zhu, Principal Software Engineer, and Allen Wang, Manager, Software Engineering, CA Technologies In the Internet era, the volume of data we deal with has grown to terabytes
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
bigdata Managing Scale in Ontological Systems
Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural
This paper defines as "Classical"
Principles of Transactional Approach in the Classical Web-based Systems and the Cloud Computing Systems - Comparative Analysis Vanya Lazarova * Summary: This article presents a comparative analysis of
The Sierra Clustered Database Engine, the technology at the heart of
A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel
On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform
On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform Page 1 of 16 Table of Contents Table of Contents... 2 Introduction... 3 NoSQL Databases... 3 CumuLogic NoSQL Database Service...
these three NoSQL databases because I wanted to see a the two different sides of the CAP
Michael Sharp Big Data CS401r Lab 3 For this paper I decided to do research on MongoDB, Cassandra, and Dynamo. I chose these three NoSQL databases because I wanted to see a the two different sides of the
NoSQL in der Cloud Why? Andreas Hartmann
NoSQL in der Cloud Why? Andreas Hartmann 17.04.2013 17.04.2013 2 NoSQL in der Cloud Why? Quelle: http://res.sys-con.com/story/mar12/2188748/cloudbigdata_0_0.jpg Why Cloud??? 17.04.2013 3 NoSQL in der Cloud
Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing
Evaluating NoSQL for Enterprise Applications Dirk Bartels VP Strategy & Marketing Agenda The Real Time Enterprise The Data Gold Rush Managing The Data Tsunami Analytics and Data Case Studies Where to go
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
Quality Attribute- Guided Evaluation of NoSQL Databases: An Experience Report
Quality Attribute- Guided Evaluation of NoSQL Databases: An Experience Report John Klein, Software Engineering Institute Ian Gorton, Software Engineering Institute Neil Ernst, Software Engineering Institute
NoSQL Database Options
NoSQL Database Options Introduction For this report, I chose to look at MongoDB, Cassandra, and Riak. I chose MongoDB because it is quite commonly used in the industry. I chose Cassandra because it has
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre
NoSQL systems: introduction and data models Riccardo Torlone Università Roma Tre Why NoSQL? In the last thirty years relational databases have been the default choice for serious data storage. An architect
INTRODUCTION TO CASSANDRA
INTRODUCTION TO CASSANDRA This ebook provides a high level overview of Cassandra and describes some of its key strengths and applications. WHAT IS CASSANDRA? Apache Cassandra is a high performance, open
MySQL. Leveraging. Features for Availability & Scalability ABSTRACT: By Srinivasa Krishna Mamillapalli
Leveraging MySQL Features for Availability & Scalability ABSTRACT: By Srinivasa Krishna Mamillapalli MySQL is a popular, open-source Relational Database Management System (RDBMS) designed to run on almost
Introduction to Polyglot Persistence. Antonios Giannopoulos Database Administrator at ObjectRocket by Rackspace
Introduction to Polyglot Persistence Antonios Giannopoulos Database Administrator at ObjectRocket by Rackspace FOSSCOMM 2016 Background - 14 years in databases and system engineering - NoSQL DBA @ ObjectRocket
Microsoft Azure Data Technologies: An Overview
David Chappell Microsoft Azure Data Technologies: An Overview Sponsored by Microsoft Corporation Copyright 2014 Chappell & Associates Contents Blobs... 3 Running a DBMS in a Virtual Machine... 4 SQL Database...
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
How to Choose Between Hadoop, NoSQL and RDBMS
How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A
Domain driven design, NoSQL and multi-model databases
Domain driven design, NoSQL and multi-model databases Java Meetup New York, 10 November 2014 Max Neunhöffer www.arangodb.com Max Neunhöffer I am a mathematician Earlier life : Research in Computer Algebra
NoSQL Database Systems and their Security Challenges
NoSQL Database Systems and their Security Challenges Morteza Amini [email protected] Data & Network Security Lab (DNSL) Department of Computer Engineering Sharif University of Technology September 25 2
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Cloud Based Distributed Databases: The Future Ahead
Cloud Based Distributed Databases: The Future Ahead Arpita Mathur Mridul Mathur Pallavi Upadhyay Abstract Fault tolerant systems are necessary to be there for distributed databases for data centers or
wow CPSC350 relational schemas table normalization practical use of relational algebraic operators tuple relational calculus and their expression in a declarative query language relational schemas CPSC350
Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)
Big Data Management in the Clouds Alexandru Costan IRISA / INSA Rennes (KerData team) Cumulo NumBio 2015, Aussois, June 4, 2015 After this talk Realize the potential: Data vs. Big Data Understand why we
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect IT Insight podcast This podcast belongs to the IT Insight series You can subscribe to the podcast through
Comparison of the Frontier Distributed Database Caching System with NoSQL Databases
Comparison of the Frontier Distributed Database Caching System with NoSQL Databases Dave Dykstra [email protected] Fermilab is operated by the Fermi Research Alliance, LLC under contract No. DE-AC02-07CH11359
Distribution transparency. Degree of transparency. Openness of distributed systems
Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science [email protected] Chapter 01: Version: August 27, 2012 1 / 28 Distributed System: Definition A distributed
A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA
A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA Ompal Singh Assistant Professor, Computer Science & Engineering, Sharda University, (India) ABSTRACT In the new era of distributed system where
Software Life-Cycle Management
Ingo Arnold Department Computer Science University of Basel Theory Software Life-Cycle Management Architecture Styles Overview An Architecture Style expresses a fundamental structural organization schema
Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores
Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores Composite Software October 2010 TABLE OF CONTENTS INTRODUCTION... 3 BUSINESS AND IT DRIVERS... 4 NOSQL DATA STORES LANDSCAPE...
NoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
Secure Cloud Transactions by Performance, Accuracy, and Precision
Secure Cloud Transactions by Performance, Accuracy, and Precision Patil Vaibhav Nivrutti M.Tech Student, ABSTRACT: In distributed transactional database systems deployed over cloud servers, entities cooperate
Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island
Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY ANN KELLY II MANNING Shelter Island contents foreword preface xvii xix acknowledgments xxi about this book xxii Part 1 Introduction
X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released
General announcements In-Memory is available next month http://www.oracle.com/us/corporate/events/dbim/index.html X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released
nosql and Non Relational Databases
nosql and Non Relational Databases Image src: http://www.pentaho.com/big-data/nosql/ Matthias Lee Johns Hopkins University What NoSQL? Yes no SQL.. Atleast not only SQL Large class of Non Relaltional Databases
Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework
Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework Many corporations and Independent Software Vendors considering cloud computing adoption face a similar challenge: how should
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK OVERVIEW ON BIG DATA SYSTEMATIC TOOLS MR. SACHIN D. CHAVHAN 1, PROF. S. A. BHURA
So What s the Big Deal?
So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data
The Cloud Trade Off IBM Haifa Research Storage Systems
The Cloud Trade Off IBM Haifa Research Storage Systems 1 Fundamental Requirements form Cloud Storage Systems The Google File System first design consideration: component failures are the norm rather than
WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS
WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS Managing and analyzing data in the cloud is just as important as it is anywhere else. To let you do this, Windows Azure provides a range of technologies
Big Data Solutions. Portal Development with MongoDB and Liferay. Solutions
Big Data Solutions Portal Development with MongoDB and Liferay Solutions Introduction Companies have made huge investments in Business Intelligence and analytics to better understand their clients and
Scalable Architecture on Amazon AWS Cloud
Scalable Architecture on Amazon AWS Cloud Kalpak Shah Founder & CEO, Clogeny Technologies [email protected] 1 * http://www.rightscale.com/products/cloud-computing-uses/scalable-website.php 2 Architect
Hosting Transaction Based Applications on Cloud
Proc. of Int. Conf. on Multimedia Processing, Communication& Info. Tech., MPCIT Hosting Transaction Based Applications on Cloud A.N.Diggikar 1, Dr. D.H.Rao 2 1 Jain College of Engineering, Belgaum, India
Designing a Cloud Storage System
Designing a Cloud Storage System End to End Cloud Storage When designing a cloud storage system, there is value in decoupling the system s archival capacity (its ability to persistently store large volumes
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
GigaSpaces Real-Time Analytics for Big Data
GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and
Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.
Architectures Cluster Computing Job Parallelism Request Parallelism 2 2010 VMware Inc. All rights reserved Replication Stateless vs. Stateful! Fault tolerance High availability despite failures If one
Amazon EC2 Product Details Page 1 of 5
Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Functionality Amazon EC2 presents a true virtual computing environment, allowing you to use web service interfaces to launch instances with a variety of
OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni
OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni Agenda Database trends for the past 10 years Era of Big Data and Cloud Challenges and Options Upcoming database trends Q&A Scope
NoSQL Evaluation. A Use Case Oriented Survey
2011 International Conference on Cloud and Service Computing NoSQL Evaluation A Use Case Oriented Survey Robin Hecht Chair of Applied Computer Science IV University ofbayreuth Bayreuth, Germany robin.hecht@uni
Cloud Computing with Microsoft Azure
Cloud Computing with Microsoft Azure Michael Stiefel www.reliablesoftware.com [email protected] http://www.reliablesoftware.com/dasblog/default.aspx Azure's Three Flavors Azure Operating
Big Data Technologies. Prof. Dr. Uta Störl Hochschule Darmstadt Fachbereich Informatik Sommersemester 2015
Big Data Technologies Prof. Dr. Uta Störl Hochschule Darmstadt Fachbereich Informatik Sommersemester 2015 Situation: Bigger and Bigger Volumes of Data Big Data Use Cases Log Analytics (Web Logs, Sensor
Cloud Infrastructure Operational Excellence & Reliability
Cloud Infrastructure Operational Excellence & Reliability Page 1 Operational Excellence & Reliability Microsoft has invested over $15 billion in building one of the world s largest global cloud infrastructures.
Structured Data Storage
Structured Data Storage Xgen Congress Short Course 2010 Adam Kraut BioTeam Inc. Independent Consulting Shop: Vendor/technology agnostic Staffed by: Scientists forced to learn High Performance IT to conduct
Multi-Datacenter Replication
www.basho.com Multi-Datacenter Replication A Technical Overview & Use Cases Table of Contents Table of Contents... 1 Introduction... 1 How It Works... 1 Default Mode...1 Advanced Mode...2 Architectural
F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar ([email protected]) 15-799 10/21/2013
F1: A Distributed SQL Database That Scales Presentation by: Alex Degtiar ([email protected]) 15-799 10/21/2013 What is F1? Distributed relational database Built to replace sharded MySQL back-end of AdWords
Understanding NoSQL on Microsoft Azure
David Chappell Understanding NoSQL on Microsoft Azure Sponsored by Microsoft Corporation Copyright 2014 Chappell & Associates Contents Data on Azure: The Big Picture... 3 Relational Technology: A Quick
Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.
Object Storage: A Growing Opportunity for Service Providers Prepared for: White Paper 2012 Neovise, LLC. All Rights Reserved. Introduction For service providers, the rise of cloud computing is both a threat
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop
Volume 4, Issue 1, January 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Transitioning
Referential Integrity in Cloud NoSQL Databases
Referential Integrity in Cloud NoSQL Databases by Harsha Raja A thesis submitted to the Victoria University of Wellington in partial fulfilment of the requirements for the degree of Master of Engineering
A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage
Volume 2, No.4, July August 2013 International Journal of Information Systems and Computer Sciences ISSN 2319 7595 Tejaswini S L Jayanthy et al., Available International Online Journal at http://warse.org/pdfs/ijiscs03242013.pdf
The CAP theorem and the design of large scale distributed systems: Part I
The CAP theorem and the design of large scale distributed systems: Part I Silvia Bonomi University of Rome La Sapienza www.dis.uniroma1.it/~bonomi Great Ideas in Computer Science & Engineering A.A. 2012/2013
High Availability for Database Systems in Cloud Computing Environments. Ashraf Aboulnaga University of Waterloo
High Availability for Database Systems in Cloud Computing Environments Ashraf Aboulnaga University of Waterloo Acknowledgments University of Waterloo Prof. Kenneth Salem Umar Farooq Minhas Rui Liu (post-doctoral
The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL [email protected] / @marcua
The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg Adam Marcus MIT CSAIL [email protected] / @marcua About Me Social Computing + Database Systems Easily Distracted: Wrote The NoSQL Ecosystem in
