Report Data Management in the Cloud: Limitations and Opportunities
|
|
- Caitlin Norris
- 8 years ago
- Views:
Transcription
1 Report Data Management in the Cloud: Limitations and Opportunities Article by Daniel J. Abadi [1] Report by Lukas Probst January 4, 2013 In this report I want to summarize Daniel J. Abadi's article [1] and further present some open questions, points of criticism and recent research results. The report is structured in the following way: The sections 1-3 contain the content of the article itself plus some little additional comments of mine. 1 Section 4 lists some important questions, which the article left open and for which there were no solutions at the time the article was published. Section 5 concentrates on the point of criticism, that an untrusted host is not always a problem for OLTP applications and thus, unlike the article concludes, an OLTP application could be deployed in the cloud. Finally section 6 presents some new research results, which were published in the meantime: The NoDB approach [2] and Google's ready-to-use cloud products [3, 4]. 1 Introduction Daniel J. Abadi denes cloud computing as "a general shift of computer processing, storage, and software delivery away from the desktop and local servers, across the network, and into next generation data centers hosted by large infrastructure companies" 2. For many companies (especially for start-ups) the pay-as-you-go computing model a cloud provides, is a perfect match. Therefore Abadi's article [1] explores if also database management applications can be deployed in the cloud. 2 Data Management in the Cloud 2.1 Cloud Characteristics The goal of section 2 is to decide which data management applications can be deployed in the cloud. For this purpose Abadi rst presents the three most important characteristics of a cloud computing environment "Compute power is elastic, but only if workload is parallelizable" As already mentioned in the introduction, the major benet of cloud computing, is its elasticity. With the pay-as-you-go model a company can prevent both, an under utilization of the existing capacity (see gure 1(a)) and a lost revenue due to insucient capacity (see gure 1(b)), by always allocating as much capacity as needed (see gure 1(c)). But the desired behavior is only achievable, if the workload is parallelizable. That is due to the fact that allocating more capacity does not mean getting a better server but getting more nodes (e.g., Amazon EC2 instances). If the workload cannot be distributed among the nodes, new nodes will not be benecial. For example as one can easily see in gure 2, parallel reads are easy to implement, while parallel writes are a hard task. In general, one can say that shared-nothing architectures are the best to parallelize. 1 Please note that except of some restructuring the sections 1-3 are mainly a repetition of Daniel J. Abadi's work [1]. 2 Daniel J. Abadi's denition of cloud computing [1] 1
2 (a) Under utilization (b) Lost revenue (c) Pay-as-you-go cloud computing model Figure 1: Static capacity vs. the pay-as-you-go cloud computing model 3 (a) Parallel read (b) Parallel write Figure 2: Illustrations of parallel reads and writes "Data is stored at an untrusted host" If a company stores sensitive data, the company cannot exclude the possibility that the host company access data without permission and for example steals or sells sensitive data (e.g., credit card numbers) even if this scenario sounds very unlikely. Moreover, since the data have to be stored physically in any country, it is governed by the laws of this country. For example Abadi mentions in his article [1], that "the USA PATRIOT Act allows the US government to demand access to the data stored on any computer". For some companies these two points can be a problem "Data is replicated, often across large geographic distances" Since cloud providers often own data centers all over the world, they can provide the highest possible degree of fault tolerance, by automatically replicating the data across large geographic distances. For example data is stored very safely in respect to availability and durability, if it is stored at the same time in the USA, Europe and Australia. 2.2 Data Management Applications in the Cloud OLTP vs. OLAP After dening the most important cloud characteristics, the article checks if the two data management applications - transactional data managament (OLTP) and analytical data management (OLAP) - can be deployed in the cloud Transactional Data Management (OLTP) The common applications using transactional data management (OLTP) need ACID guarantees and further include many write operations. Since the requested data is typically distributed on several sites and thus transactions cannot be limited to access only data on one site, it would neccessitate "complex distibuted locking and commit protocols" to implement transactional data management systems with a shared-nothing architecture. 3 Source: Uni Basel, Departement of Mathematics and Computer Science, cs341 Distributed Information Systems (Fall Semester 2012) lecture slides, Chapter 7: Cloud Computing & NoSQL, slides 28-30, fileadmin/lectures/hs2012/cs341/slides/07-cs341-hs12-cloud_computing-nosql.pdf 2
3 For this reason, none of the 4 big players (Oracle, IBM DB2, Microsoft SQL Server and Sybase) has a shared-nothing transactional database. Furthermore it is hard to maintain ACID guarantees in the cloud. The CAP theorem shows, that one can only choose two out of three properties: consistency, availability and tolerance to partitions. Because partitions cannot be excluded, one always needs tolerance to partitions. Hence one typically decides to disregard the consistency (C from ACID), to gain a good availability. Moreover OLTP databases typically contain all the data, i.e, also the sensitive information such as credit card numbers. Hence, Abadi argues that it is an enormous risk to store transactional data on an untrusted host and that this risk is typically unacceptable and therefore transactional data cannot be stored at an untrusted host. Due to these observations, Abadi concludes that OLTP applications are not well-suited for cloud deployment Analytical Data Management (OLAP) Abadi argues in his article [1] that since the shared-nothing architecture scales the best and due to the huge amount of data scalability is very important for OLAP systems, this architecture is a good match. Furthermore the fact that the data analysis workloads tend to be read-only with only "infrequent writes" leads to two additional advantages: Firstly data analysis workloads are easy to parallelize across nodes (see section 2.1.3) and secondly there is no need for "complex distributed locking and commit protocols". Moreover since small inconsistencies are not problematic for analytical queries (e.g., computing the average customer age), the consistency trado (CAP theorem) is no problem for OLAP applications. Finally there are multiple possibilities to handle sensitive data for the analysis on an untrusted host. Abadi proposes in his article [1] that the sensitive data can be left out, anonymized or encrypted. Furthermore he suggests the possibility to store only aggregated data (e.g. averages, sums,...). Thus untrusted hosts can be used for storing analytical data. Because of these facts, Abadi concludes that OLAP applications, in contrast to OLTP applications, are well-suited for cloud deployment. 3 Data Analysis in the Cloud The rest of the article concentrates on how to perform data analysis (OLAP) in the cloud. Thereby the Abadi focuses in his article [1] on two classes of software solutions: "MapReduce-like software" and "commercially available shared-nothing parallel databases". 3.1 Cloud DBMS Wish List Before taking a closer look at MapReduce and shared-nothing parallel DBs the article [1] lists some properties that a good solution should provide: 1. Eciency: If one only pays for what one uses, the price increases linearly with the used resources. Hence one wants to use the most ecient OLAP software solution, because more ecient software is cheaper to use. 2. Fault Tolerance: Fault tolerance in terms of read-only (OLAP) queries means, that a query does not have to be restarted if a single node involved in the query fails. The problem is, that in a cloud, where the many involved nodes (e.g., Amazon EC2 instances) have a high failure rate (customer electronic), the probability of a failure on a single node during a long query is very high. Thus the system must be able to handle single failures without restarting the whole query. 3. Ability to run in a heterogeneous environment: Due to sometimes occurring hardware failures (e.g., a failing core) cloud computing nodes are unfortunately not as homogeneous as they should be. If the work is equally distributed to all nodes, the time to complete the query will be equal to the time the slowest node needs to complete its task. Because of this a system should have the ability to handle heterogeneous environments. 4. Ability to interface with business intelligence products: Since business analysts are typically no computer scientists there are many so called "business intelligence products" which helps them to generate queries and visualize results. If the database software wants to support these tools, it has to accept SQL queries over ODBC or JDBC connections. 3
4 (a) Fault tolerance (b) Heteregeneous Figure 3: MapReduce's ability to handle faults and slow nodes 5. Ability to operate on encrypted data: As already mentioned, a possibility to solve the untrusted host problem is storing only encrypted data. Abadi [1] argues that because providing the cloud application the possibility to encrypt the data would destroy the protection and transferring the data for encryption would be to bandwidth intensive, the system should be able to operate directly on encrypted data. 3.2 MapReduce vs. Shared-Nothing parallel DBs After presenting the desired properties, Abadi checks in his article [1] how good the two available solutions satisfy these properties: "MapReduce-like software" and "commercially available shared-nothing parallel databases". Although some people say, that comparing MapReduce to database systems is like "apples-to-oranges", I agree with Abadi's position that it is warranted, because in my opinion it is justiable to compare how two approaches solve the same problem Eciency Analytical queries perform much slower in MapReduce than in alternative systems like shared-nothing parallel DBs. Abadi argues in his article [1] that the reason for that is, that MapReduce was designed for working on unstructured data for which its "brute force scan strategy" is a good idea (e.g., creating web indexes). But in analytical data stores, where the data is structured, the shared-nothing parallel databases with their typical helper structures like indexes or dimensions outperform MapReduce. Some people say, that it is a feature that MapReduce does not have such helper structures because they need time be created when data is loaded, but usually the long-time benet outweighed these creation costs. I do not agree with Abadi's opinion that MapReduce's performance is a matter of debate, because what we wanted to check is only how MapReduce performs for analytical queries on large data stores and for this application MapReduce is very inecent Fault Tolerance While MapReduce is designed to be fault tolerant, the most parallel database systems are not. MapReduces can handle a single node failure by simply reassigning the data split (task) to a new worker node (see gure 3(a)). In contrast, shared-nothing parallel DBs are designed to run on special hardware, where failures are uncommon. Consequently they are not fault tolerant and restart a query if a single node fails Ability to run in a heterogeneous environment MapReduce can also handle heterogeneous environments with some slow nodes with nearly the same mechanism. For that purpose it simply has to reassign the split assigned to the slow worker to a second worker node if the most worker nodes already nished their tasks (see gure 3(b)). Conversely, sharednothing parallel DBs cannot handle heterogeneous nodes, because, like already mentioned, they are designed to run on special hardware. Due to this a single slow node can have a huge impact to the total query execution time. 4
5 Property MapReduce Shared-nothing parallel DB 1. Eciency 2. Fault Tolerance 3. Heterogeneous environment 4. Business intelligence products 5. Encrypted data Table 1: Overview which properties are fullled by the two software solutions Ability to interface with business intelligence products While in shared-nothing parallel DBs the ability to interface with business intelligence products comes for free, MapReduce is not SQL compatible and therefore it is not easy to use existing business intelligence products with MapReduce systems Ability to operate on encrypted data None of the both software solutions, has a native ability to operate directly on encrypted data. In MapReduce the only possibility is to provide user-dened code. Similarly, if more advanced operations than moving or copying encrypted data should be performed in shared-nothing parallel DBs, user-dened functions are required. 3.3 Conclusion A call for a hybrid solution As one can easily see in table 1, neither MapReduce nor shared-nothinig parallel DBs can fulll all properties. But except of the ability to operate on encrypted data, each property is fullled by one of the two solutions. Hence Abadi proposes that a hybrid solution would be the perfect solution. There is already some recent work done, which Abadi presents in his article [1], but regrettably the recent work only focuses on language and interface issues using SQL in MapReduce and using MapReduce functions in parallel databases. Finally Abadi presents in his article [1] two research questions and his ideas how to solve these. The rst question is, how to combine MapReduce's ability to directly work with the data and the performance increase through using helper data structures. His idea how to solve this problem is an incremantal algorithm which makes progress creating helper data structures each time the data is accessed. The second problem is that fault tolerance needs saving intermediate results and this costs performance. So the question is how to balance between fault tolerance and eciency. Abadi's idea to solve this problem is to build a system which autonomous self-adjusts the level of fault-tolerance based on the observed failure rate. 4 Open Questions Daniel J. Abadi's article [1] concentrates on these two questions: What can we do in the cloud? What solutions do we want for that? Although the article discusses and answers these two questions very detailed, there are still some open questions which have to be answered before one can deploy OLAP applications in the cloud. For example: How can we use the cloud today for data warehousing? Are there any useful products today we can use? How can we implement the hybrid solution? In this section, I will take a second look on the three proposed software solutions MapReduce, sharednothing parallel DBs and the hybrid solution and present some still unsolved problems. 5
6 Figure 4: MapReduce in the cloud 4.1 MapReduce Let us assume that we decided to run a MapReduce-like software in the cloud to support the OLAP applications. If one takes a look at gure 4 which illustrates this scenario, one can see that in this case we are faced with two questions. Firstly, we need many worker nodes to compute the map and the reduce step and another node to collecting the results. The question is, what kind of server instances (or other cloud products) should be used as nodes to gain the best performance. My suggestion would be to use an Amazon EC2 instance for each node, but the article presents no evaluations which could show that this would be a good or a bad decision. And even if this rst problem is solved, there is still the problem where to store the data in the cloud. There are reams of dierent cloud products to store data (e.g., Amazon S3), but the article does not provide any recommendation which to use. 4.2 Shared-nothing parallel databases In the second example scenario (illustrated in gure 5), we assume that our company currently owns many data warehouses and now wants to use only one giant shared-nothing parallel data warehouse in a cloud. So the rst question is, whether there is any existent shared-nothing parallel data warehouse product in any cloud we can use. If this is not the case we cannot solve the task unless our company wants to implement the product on its own. But even if we assume that there is such a product, we still have to solve the problem, how to integrate the data from the local data warehouses to the new cloud product. Additional to the typical schema integration problems, we are faced with another problem: Since typically the amount of data stored in data warehouses contain several petabytes, it is a big problem how to transfer the data from the local data warehouses to the cloud. Because with a common internet connection the integration task would take too long, there have to be another solution, which the article does not provide. 4.3 Hybrid solution As a conclusion of his article [1], Daniel J. Abadi proposes an hybrid as the perfect solution for data analysis in the cloud. Furthermore he presented some ideas how to solve the remaining research questions. Although the idea of having an incremental algorithm and an autonomous self-adjusting system sounds quite nice, Abadi does not mention if there are any sophisticated concepts implemented or at least presented yet. 5 Critique: "Untrusted hosts" are usable for OLTP Daniel J. Abadi argues in his article [1], that it is an enormous risk to store transactional data on an untrusted host, because OLTP data includes sensitive data. In the discussion after my workshop 6
7 Figure 5: Shared-Nothing parallel databases in the cloud and integration of existing data warehouses presentation we came to the conclusion, that this is not totally true. There are two possibilities for unauthorized data access in the cloud listed in the article. The rst is that the cloud provider itself steals or sells the data. As Abadi already mentioned in his article [1] this is very unlikely because in this case the cloud provider would eventually lose all its business customers. As a second risk Abadi argues that the USA PATRIOT Act gives the US government the right to access data on all computers located in the US and therefore also the right to access data stored in a cloud provider's data center in the US. In our discussion we gured out, that rstly it is not the idea of the US government to spy companies and secondly Abadi's point is simply not true. The USA PATRIOT Act only says, that internet providers have to disclose their data, i.e., the US government has the possibility to monitor and access data while the data is traveling through the internet. Hence if the sensitive data (e.g., credit card numbers) is only stored in the cloud and not transferred through the internet, no company has to fear that its sensitive data will be accessed without its knowledge. Since additionally writing a shared-nothing OLTP system is only hard but not impossible and other workshop topics presented solutions for the ACID problem, in my opinion it should be possible to deploy OLTP applications in the cloud. 6 Latest research results Since Daniel J. Abadi's article [1] was published in 2009, there were some new research results presented in the meantime. In this section I want to present the NoDB approach [2] as well as Google's ready-to-use cloud solutions [3, 4]. 6.1 NoDB In the NoDB article [2] they argue, that the major problem for applications, which have to handle giant amounts of data (e.g., social networks), is that in state-of-the-art OLAP database systems this giant amount of data has to be fully loaded and initialized before any data can be accessed (see also section 3.2.1). To handle this the NoDB approach [2] introduces the idea of "adaptive data loads" which is very similar to the incremental algorithm idea by Daniel J. Abadi [1]. In the article [2] they could furthermore show that their NoDB implementation (PostgresRaw) can compete with traditional DBMSs like PostgreSQL, i.e., the TPC-H performance was equivalent or faster. 6.2 Google's Cloud Solutions Google presented ready-to-use solutions for both OLTP and OLAP applications in its own cloud. Google BigQuery [3] is Google's OLAP solution. In Google BigQuery a customer can run analytical select queries in short time but he cannot run any update or delete queries. If a customer wants to have an OLTP database in the cloud, he can use Google Cloud SQL [4] instead. 4 4 This paragraph is only a short summary. For a more detailed comparison take a look at com/bigquery/docs/overview 7
8 References [1] D. J. Abadi, Data Management in the Cloud: Limitations and Opportunities, IEEE Data(base) Engineering Bulletin, vol. 32, pp. 312, [2] I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki, NoDB: ecient query execution on raw data les, pp , [3] Google BigQuery. December [4] Google Cloud SQL. December
Daniel J. Adabi. Workshop presentation by Lukas Probst
Daniel J. Adabi Workshop presentation by Lukas Probst 3 characteristics of a cloud computing environment: 1. Compute power is elastic, but only if workload is parallelizable 2. Data is stored at an untrusted
More informationData Management in the Cloud. Zhen Shi
Data Management in the Cloud Zhen Shi Overview Introduction 3 characteristics of cloud computing 2 types of cloud data management application 2 types of cloud data management architecture Conclusion Introduction
More informationData Management in the Cloud: Limitations and Opportunities. Annies Ductan
Data Management in the Cloud: Limitations and Opportunities Annies Ductan Discussion Outline: Introduc)on Overview Vision of Cloud Compu8ng Managing Data in The Cloud Cloud Characteris8cs Data Management
More informationEthopian Database Management system as a Cloud Service: Limitations and advantages
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 12, Issue 5 (Jul. - Aug. 2013), PP 34-38 Ethopian Database Management system as a Cloud Service: Limitations
More informationSQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford
SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems
More informationCan the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
More informationBIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
More informationPreparing Your Data For Cloud
Preparing Your Data For Cloud Narinder Kumar Inphina Technologies 1 Agenda Relational DBMS's : Pros & Cons Non-Relational DBMS's : Pros & Cons Types of Non-Relational DBMS's Current Market State Applicability
More informationHadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010
Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More
More information5-Layered Architecture of Cloud Database Management System
Available online at www.sciencedirect.com ScienceDirect AASRI Procedia 5 (2013 ) 194 199 2013 AASRI Conference on Parallel and Distributed Computing and Systems 5-Layered Architecture of Cloud Database
More informationBig Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.
Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology
More informationCloud DBMS: An Overview. Shan-Hung Wu, NetDB CS, NTHU Spring, 2015
Cloud DBMS: An Overview Shan-Hung Wu, NetDB CS, NTHU Spring, 2015 Outline Definition and requirements S through partitioning A through replication Problems of traditional DDBMS Usage analysis: operational
More informationBig Data Database Revenue and Market Forecast, 2012-2017
Wikibon.com - http://wikibon.com Big Data Database Revenue and Market Forecast, 2012-2017 by David Floyer - 13 February 2013 http://wikibon.com/big-data-database-revenue-and-market-forecast-2012-2017/
More informationData Management in the Cloud: Limitations and Opportunities
Data Management in the Cloud: Limitations and Opportunities Daniel J. Abadi Yale University New Haven, CT, USA dna@cs.yale.edu Abstract Recently the cloud computing paradigm has been receiving significant
More informationWhen to consider OLAP?
When to consider OLAP? Author: Prakash Kewalramani Organization: Evaltech, Inc. Evaltech Research Group, Data Warehousing Practice. Date: 03/10/08 Email: erg@evaltech.com Abstract: Do you need an OLAP
More informationBig Data & Cloud Computing. Faysal Shaarani
Big Data & Cloud Computing Faysal Shaarani Agenda Business Trends in Data What is Big Data? Traditional Computing Vs. Cloud Computing Snowflake Architecture for the Cloud Business Trends in Data Critical
More informationHow to Enhance Traditional BI Architecture to Leverage Big Data
B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
More informationSo What s the Big Deal?
So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationIntroduction to NOSQL
Introduction to NOSQL Université Paris-Est Marne la Vallée, LIGM UMR CNRS 8049, France January 31, 2014 Motivations NOSQL stands for Not Only SQL Motivations Exponential growth of data set size (161Eo
More informationAnalytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationArchitecting Your Company. Ann Winblad Co-Founder and Managing Director
Architecting Your Company Ann Winblad Co-Founder and Managing Director 1990 Embedded Systems Intel A History of Defining Software Innovation 1991 BI/ OLAP Oracle 1995 App Server Sun Est. 1989 1996 Behavioral
More informationCloud Database Emergence
Abstract RDBMS technology is favorable in software based organizations for more than three decades. The corporate organizations had been transformed over the years with respect to adoption of information
More informationP4.1 Reference Architectures for Enterprise Big Data Use Cases Romeo Kienzler, Data Scientist, Advisory Architect, IBM Germany, Austria, Switzerland
P4.1 Reference Architectures for Enterprise Big Data Use Cases Romeo Kienzler, Data Scientist, Advisory Architect, IBM Germany, Austria, Switzerland IBM Center of Excellence for Data Science, Cognitive
More informationBig Data Technologies Compared June 2014
Big Data Technologies Compared June 2014 Agenda What is Big Data Big Data Technology Comparison Summary Other Big Data Technologies Questions 2 What is Big Data by Example The SKA Telescope is a new development
More informationNoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
More informationBig Data and Big Analytics
Big Data and Big Analytics Introducing SciDB Open source, massively parallel DBMS and analytic platform Array data model (rather than SQL, Unstructured, XML, or triple-store) Extensible micro-kernel architecture
More informationManaging Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
More informationGEOG 482/582 : GIS Data Management. Lesson 10: Enterprise GIS Data Management Strategies GEOG 482/582 / My Course / University of Washington
GEOG 482/582 : GIS Data Management Lesson 10: Enterprise GIS Data Management Strategies Overview Learning Objective Questions: 1. What are challenges for multi-user database environments? 2. What is Enterprise
More informationTRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS
9 8 TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS Assist. Prof. Latinka Todoranova Econ Lit C 810 Information technology is a highly dynamic field of research. As part of it, business intelligence
More informationWhere We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL
More informationBig Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationQLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering
QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering June 2014 Page 1 Contents Introduction... 3 About Amazon Web Services (AWS)... 3 About Amazon Redshift... 3 QlikView on AWS...
More informationI N T E R S Y S T E M S W H I T E P A P E R F O R F I N A N C I A L SERVICES EXECUTIVES. Deploying an elastic Data Fabric with caché
I N T E R S Y S T E M S W H I T E P A P E R F O R F I N A N C I A L SERVICES EXECUTIVES Deploying an elastic Data Fabric with caché Deploying an elastic Data Fabric with caché Executive Summary For twenty
More informationEvaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing
Evaluating NoSQL for Enterprise Applications Dirk Bartels VP Strategy & Marketing Agenda The Real Time Enterprise The Data Gold Rush Managing The Data Tsunami Analytics and Data Case Studies Where to go
More informationNext-Generation Cloud Analytics with Amazon Redshift
Next-Generation Cloud Analytics with Amazon Redshift What s inside Introduction Why Amazon Redshift is Great for Analytics Cloud Data Warehousing Strategies for Relational Databases Analyzing Fast, Transactional
More informationbigdata Managing Scale in Ontological Systems
Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural
More informationData Refinery with Big Data Aspects
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
More informationStructured Data Storage
Structured Data Storage Xgen Congress Short Course 2010 Adam Kraut BioTeam Inc. Independent Consulting Shop: Vendor/technology agnostic Staffed by: Scientists forced to learn High Performance IT to conduct
More informationBIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES
BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES Relational vs. Non-Relational Architecture Relational Non-Relational Rational Predictable Traditional Agile Flexible Modern 2 Agenda Big Data
More informationIntroducing Oracle Exalytics In-Memory Machine
Introducing Oracle Exalytics In-Memory Machine Jon Ainsworth Director of Business Development Oracle EMEA Business Analytics 1 Copyright 2011, Oracle and/or its affiliates. All rights Agenda Topics Oracle
More informationStep by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015
Step by Step: Big Data Technology Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015 Data Sources IT Infrastructure Analytics 2 B y 2015, 20% of Global 1000 organizations
More informationData Management in the Cloud
Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
More informationBig Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料
Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料 美 國 13 歲 學 生 用 Big Data 找 出 霸 淩 熱 點 Puri 架 設 網 站 Bullyvention, 藉 由 分 析 Twitter 上 找 出 提 到 跟 霸 凌 相 關 的 詞, 搭 配 地 理 位 置
More informationwww.ijreat.org Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (www.prdg.org) 28
Data Warehousing - Essential Element To Support Decision- Making Process In Industries Ashima Bhasin 1, Mr Manoj Kumar 2 1 Computer Science Engineering Department, 2 Associate Professor, CSE Abstract SGT
More informationHow In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
More informationQlikView Business Discovery Platform. Algol Consulting Srl
QlikView Business Discovery Platform Algol Consulting Srl Business Discovery Applications Application vs. Platform Application Designed to help people perform an activity Platform Provides infrastructure
More informationIn-Memory Analytics for Big Data
In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...
More informationDriving Peak Performance. 2013 IBM Corporation
Driving Peak Performance 1 Session 2: Driving Peak Performance Abstract We know you want the fastest performance possible for your deployments, and yet that relies on many choices across data storage,
More informationDATAOPT SOLUTIONS. What Is Big Data?
DATAOPT SOLUTIONS What Is Big Data? WHAT IS BIG DATA? It s more than just large amounts of data, though that s definitely one component. The more interesting dimension is about the types of data. So Big
More informationINTRODUCTION TO CASSANDRA
INTRODUCTION TO CASSANDRA This ebook provides a high level overview of Cassandra and describes some of its key strengths and applications. WHAT IS CASSANDRA? Apache Cassandra is a high performance, open
More informationElastic Data Warehousing in the Cloud Is the sky really the limit?
Elastic Data Warehousing in the Cloud Is the sky really the limit? By Kees van Gelder Faculty of exact sciences Vrije Universiteit Amsterdam, the Netherlands Index Abstract... 3 1. Introduction... 3 2.
More informationIntroduction to Apache Cassandra
Introduction to Apache Cassandra White Paper BY DATASTAX CORPORATION JULY 2013 1 Table of Contents Abstract 3 Introduction 3 Built by Necessity 3 The Architecture of Cassandra 4 Distributing and Replicating
More informationBIG DATA-AS-A-SERVICE
White Paper BIG DATA-AS-A-SERVICE What Big Data is about What service providers can do with Big Data What EMC can do to help EMC Solutions Group Abstract This white paper looks at what service providers
More informationIn Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
More informationThe evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect IT Insight podcast This podcast belongs to the IT Insight series You can subscribe to the podcast through
More informationSQL Maestro and the ELT Paradigm Shift
SQL Maestro and the ELT Paradigm Shift Abstract ELT extract, load, and transform is replacing ETL (extract, transform, load) as the usual method of populating data warehouses. Modern data warehouse appliances
More informationDatacenters and Cloud Computing. Jia Rao Assistant Professor in CS http://cs.uccs.edu/~jrao/cs5540/spring2014/index.html
Datacenters and Cloud Computing Jia Rao Assistant Professor in CS http://cs.uccs.edu/~jrao/cs5540/spring2014/index.html What is Cloud Computing? A model for enabling ubiquitous, convenient, ondemand network
More informationBig Data on Cloud Computing- Security Issues
Big Data on Cloud Computing- Security Issues K Subashini, K Srivaishnavi UG Student, Department of CSE, University College of Engineering, Kanchipuram, Tamilnadu, India ABSTRACT: Cloud computing is now
More informationDatabase Management System as a Cloud Service
Database Management System as a Cloud Service Yvette E. Gelogo 1 and Sunguk Lee 2 * 1 Society of Science and Engineering Research Support, Korea vette_mis@yahoo.com 2 Research Institute of Industrial Science
More informationCloud Data Management Big Data
Cloud Data Management Big Data Vera Goebel Fall 2015 1 Cloud Computing The vision On demand, reliable services provided over the Internet (the cloud ) with easy access to virtually infinite computing,
More informationThe Modern Online Application for the Internet Economy: 5 Key Requirements that Ensure Success
The Modern Online Application for the Internet Economy: 5 Key Requirements that Ensure Success 1 Table of Contents Abstract... 3 Introduction... 3 Requirement #1 Smarter Customer Interactions... 4 Requirement
More informationGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool FOSS4G 2010 Dr. Thierry Badard, CTO Spatialytics inc. Quebec, Canada tbadard@spatialytics.com Barcelona, Spain Sept 9th, 2010 What is GeoKettle? It is
More informationBenchmarking and Analysis of NoSQL Technologies
Benchmarking and Analysis of NoSQL Technologies Suman Kashyap 1, Shruti Zamwar 2, Tanvi Bhavsar 3, Snigdha Singh 4 1,2,3,4 Cummins College of Engineering for Women, Karvenagar, Pune 411052 Abstract The
More informationSELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM
David Chappell SELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM A PERSPECTIVE FOR SYSTEMS INTEGRATORS Sponsored by Microsoft Corporation Copyright 2014 Chappell & Associates Contents Business
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services
More informationConjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
More informationThe Inside Scoop on Hadoop
The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. Orion.Gebremedhin@Neudesic.COM B-orgebr@Microsoft.com @OrionGM The Inside Scoop
More informationNoSQL. Thomas Neumann 1 / 22
NoSQL Thomas Neumann 1 / 22 What are NoSQL databases? hard to say more a theme than a well defined thing Usually some or all of the following: no SQL interface no relational model / no schema no joins,
More informationBussiness Intelligence and Data Warehouse. Tomas Bartos CIS 764, Kansas State University
Bussiness Intelligence and Data Warehouse Schedule Bussiness Intelligence (BI) BI tools Oracle vs. Microsoft Data warehouse History Tools Oracle vs. Others Discussion Business Intelligence (BI) Products
More informationORACLE DATABASE 10G ENTERPRISE EDITION
ORACLE DATABASE 10G ENTERPRISE EDITION OVERVIEW Oracle Database 10g Enterprise Edition is ideal for enterprises that ENTERPRISE EDITION For enterprises of any size For databases up to 8 Exabytes in size.
More informationWhy compute in parallel? Cloud computing. Big Data 11/29/15. Introduction to Data Management CSE 344. Science is Facing a Data Deluge!
Why compute in parallel? Introduction to Data Management CSE 344 Lectures 23 and 24 Parallel Databases Most processors have multiple cores Can run multiple jobs simultaneously Natural extension of txn
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationCIO Guide How to Use Hadoop with Your SAP Software Landscape
SAP Solutions CIO Guide How to Use with Your SAP Software Landscape February 2013 Table of Contents 3 Executive Summary 4 Introduction and Scope 6 Big Data: A Definition A Conventional Disk-Based RDBMs
More informationInfiniteGraph: The Distributed Graph Database
A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086
More informationIl mondo dei DB Cambia : Tecnologie e opportunita`
Il mondo dei DB Cambia : Tecnologie e opportunita` Giorgio Raico Pre-Sales Consultant Hewlett-Packard Italiana 2011 Hewlett-Packard Development Company, L.P. The information contained herein is subject
More informationIntegrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
More informationNoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015
NoSQL Databases Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015 Database Landscape Source: H. Lim, Y. Han, and S. Babu, How to Fit when No One Size Fits., in CIDR,
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationOLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH
OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH 1 Online Analytic Processing OLAP 2 OLAP OLAP: Online Analytic Processing OLAP queries are complex queries that Touch large amounts of data Discover
More informationCloud Service Model. Selecting a cloud service model. Different cloud service models within the enterprise
Cloud Service Model Selecting a cloud service model Different cloud service models within the enterprise Single cloud provider AWS for IaaS Azure for PaaS Force fit all solutions into the cloud service
More informationCollege of Engineering, Technology, and Computer Science
College of Engineering, Technology, and Computer Science Design and Implementation of Cloud-based Data Warehousing In partial fulfillment of the requirements for the Degree of Master of Science in Technology
More informationAlexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data
INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationBig Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012
Big Data Buzzwords From A to Z By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords Big data is one of the, well, biggest trends in IT today, and it has spawned a whole new generation
More informationCourse 103402 MIS. Foundations of Business Intelligence
Oman College of Management and Technology Course 103402 MIS Topic 5 Foundations of Business Intelligence CS/MIS Department Organizing Data in a Traditional File Environment File organization concepts Database:
More informationA Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems
A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems Ismail Hababeh School of Computer Engineering and Information Technology, German-Jordanian University Amman, Jordan Abstract-
More informationCloudDB: A Data Store for all Sizes in the Cloud
CloudDB: A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management Research NEC Laboratories America http://www.nec-labs.com/dm www.nec-labs.com What I will try to cover Historical perspective
More informationLecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
More informationData Modeling for Big Data
Data Modeling for Big Data by Jinbao Zhu, Principal Software Engineer, and Allen Wang, Manager, Software Engineering, CA Technologies In the Internet era, the volume of data we deal with has grown to terabytes
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationData Warehousing. Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de. Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1
Jens Teubner Data Warehousing Winter 2015/16 1 Data Warehousing Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Winter 2015/16 Jens Teubner Data Warehousing Winter 2015/16 13 Part II Overview
More informationDAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY
Big Data Analytics DAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY Tom Haughey InfoModel, LLC 868 Woodfield Road Franklin Lakes, NJ 07417 201 755 3350 tom.haughey@infomodelusa.com
More informationBig Data Defined Introducing DataStack 3.0
Big Data Big Data Defined Introducing DataStack 3.0 Inside: Executive Summary... 1 Introduction... 2 Emergence of DataStack 3.0... 3 DataStack 1.0 to 2.0... 4 DataStack 2.0 Refined for Large Data & Analytics...
More informationFoundations of Business Intelligence: Databases and Information Management
Foundations of Business Intelligence: Databases and Information Management Content Problems of managing data resources in a traditional file environment Capabilities and value of a database management
More informationDistributed Architecture of Oracle Database In-memory
Distributed Architecture of Oracle Database In-memory Niloy Mukherjee, Shasank Chavan, Maria Colgan, Dinesh Das, Mike Gleeson, Sanket Hase, Allison Holloway, Hui Jin, Jesse Kamp, Kartik Kulkarni, Tirthankar
More informationChapter 18: Database System Architectures. Centralized Systems
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
More information