Introduction to Multi-Data Center Operations with Apache Cassandra, Hadoop, and Solr WHITE PAPER
|
|
- Sherilyn Welch
- 8 years ago
- Views:
Transcription
1 Introduction to Multi-Data Center Operations with Apache Cassandra, Hadoop, and Solr WHITE PAPER By DataStax Corporation August 2012
2 Contents Introduction...3 The Growth in Multiple Data Centers...3 Why Multi-Data Center Datastores?...4 A Brief Multi-Data Center Database Checklist...4 A Look at Apache Cassandra...5 Cassandra and Multiple Data Centers...6 Multi-Data Center Performance...7 Running Apache Hadoop and Solr Across Multiple Data Centers...8 Options for Multi-Data Center Hadoop and Solr...8 A Look at DataStax Enterprise...9 Multi-Data Center Support for Hadoop...9 Multi-Data Center Support for Solr...10 What About the Cloud?...10 Managing and Monitoring Multi-Data Center Deployments...11 Multi-Data Center Customer Examples...12 Conclusion...13 About DataStax DataStax. All rights reserved. 2
3 Introduction Many modern businesses have external-facing database applications that are dramatically growing, and which serve a customer base that is geographically dispersed. Numerous companies also have workforces that are highly distributed in nature, with each employee needing fast access to the same corporate information no matter where they happen to be located. A database that easily spans multiple data centers and/or the cloud ensures the fastest possible response times (both read and write) for customers and employees who are geographically separated. A multi-data center database also provides a number of other benefits such as protecting information from loss in the event that a single data center experiences a disaster. This paper discusses why multi-data center databases are fast becoming the new norm for database operations, along with what characteristics a database must possess to run across many data centers and the cloud at once. Focus is then turned to how Apache Cassandra, Hadoop, and Solr can be easily configured to run across multiple data centers and cloud providers to meet the requirements of those needing a smart and agile datastore that is truly location independent. The Growth in Multiple Data Centers A 2012 article in InfoWorld divulged interesting statistics about the rise and growth of multi-data centers. In their latest poll of data center managers, the Uptime Institute discovered that 80 percent of respondents have built a new data center or upgraded an existing facility within the past five years. 1 The same article cited another study of the North American data center market done by Digital Realty Trust. In that study, 92 percent of respondents said their companies will definitely or probably expand their data center space in 2012 the highest percentage reported in six years. This news, coupled with the fact that data centers are primarily put in place to hold (no surprise) corporate data, makes it plain to see that the need for databases that can easily span and interact between multiple data centers is only going to escalate and likely at a rapid clip. 1 Large enterprises handing off data center builds as demand booms, by Ann Bednarz, InfoWorld, April 23, 2012: DataStax. All rights reserved. 3
4 Why Multi-Data Center Datastores? The reasons why a multi-data center datastore is needed vary. Some use cases involve just the simple desire for a good disaster recovery plan. But the majority of multi-data center use cases revolve around needing to keep one logical database synched up between 1-N physical data centers and to deliver, as quickly as possible, response times for the users that each data center serves. One other factor contributing to the multi-data center discussion is big data. Those familiar with the term big data normally can recite the three V s of what makes up big data: velocity, volume, and variety. However, one overlooked aspect of big data systems is complexity, which, according to Gartner Inc., involves the domain of managing data across many different data centers, time zones, geographies, and so forth. 2 Distributing data across many different data centers and the cloud is not an easy task with traditional databases. When one adds characteristics of data that is coming in at extremely high rates of speed from many places, data that is of varying formats, and data that can involve heavy volumes, the job becomes even harder. A Brief Multi-Data Center Database Checklist Even outside of big data environments, legacy relational databases (RDBMSs), the primary datastores for most businesses, have traditionally provided minimal support for multi-data centers. Other than basic replication or one-way mirroring, all RDBMS vendors lack key built-in features needed by modern applications that require a datastore that spans many different data centers and/or cloud geographies. This raises the question: What are the features and capabilities that a modern database/datastore needs to meet the demands of multi-data center operations? Does it just equate to log shipping, mirroring between data centers, or master-slave replication or is it something else? Increasingly, the must-have short list from those wanting modern multi-data center capabilities includes the following: The ability to span 1-N data centers, and not just two. This includes the agility to handle multiple cloud geo-zones as well. Multidirectional syncs between all participating data centers, and not just one way. Or, in other words, the desire to have truly location independent, read and write anywhere freedom. 2 Big Data Is Only the Beginning of Extreme Information Management, by Beyer, et al., Gartner Group Inc., April 7, 2011: DataStax. All rights reserved. 4
5 Built-in network intelligence, so that data is smartly transferred between data centers to minimize bandwidth overload and latency issues. The ability to support the required type of data traffic across data centers (e.g. real-time, analytic, search). Capabilities for handling big data use cases in a way where all data centers appear as just one logical database to an end user application. Pulling this off is not easy unless one starts with the right database architecture and feature set. Traditional master-slave designs inherent in RDBMSs and some NoSQL solutions are many times practically impossible, as the requirement for true location independence cannot be met. Fortunately, Apache Cassandra possesses the right blend of technical features and big data capabilities to handle modern multi-data center and cloud deployments. A Look at Apache Cassandra Apache Cassandra is a massively scalable NoSQL database. Cassandra s technical roots can be found at companies recognized for their ability to effectively tackle big data Google, Amazon, and Facebook. Used today by numerous modern businesses to manage their critical data infrastructure, Cassandra is known for being the solution technical professionals turn to when they need a realtime NoSQL database that supplies high performance at massive scale, which never goes down. Rather than using a legacy master-slave or a manual and difficult-to-maintain sharded design, Cassandra has a peer-to-peer distributed ring architecture that is much more elegant, easy to set up, and maintain. In Cassandra, all nodes are the same; there is no concept of a master node, with all nodes communicating with each other via a gossip protocol. Cassandra s built-for-scale architecture means that it is capable of handling petabytes of information and thousands of concurrent users/operations per second across one to many data centers as easily as it can manage much smaller amounts of data and user traffic. It also means that, unlike other master-slave or sharded systems, Cassandra has no single point of failure and therefore is capable of offering true continuous availability DataStax. All rights reserved. 5
6 Cassandra and Multiple Data Centers Cassandra s architecture is tailor-made for multiple data centers. Its peer-to-peer design (vs. legacy master-slave implementations) coupled with online scale-out and full redundancy that offers no single points of failure and continuous availability make it ideal in multi-data center environments. Because Cassandra is a masterless architecture, all nodes are the same and all nodes offer full read/write capabilities in a database cluster, regardless of where those nodes are physically located. A single Cassandra ring (or database cluster) can certainly exist at just one physical data center. However, Cassandra can easily support a single database spanning multiple data centers, where each data center holds its own copy of the database and can have as many nodes as needed for supporting that site: Figure 1: A Single Cassandra Database with Multiple Data Centers Creating a database that spans multiple data centers in Cassandra is easy and is accomplished via the definition of a new database. Once the database software has been installed on all machines in all participating data centers and is running, and network communication has been established among all the nodes, a keyspace (analogous to an RDBMS database) is created using Cassandra s CQL language. Within the definition of a keyspace, each data center is identified (with the ID matching configuration parameters that have been previously set) along with the number of copies of the data that the keyspace will hold in each data center. For example, the syntax below creates a new keyspace named Globalbiz, with three data centers (DC1, DC2, and DC3): the first and 2012 DataStax. All rights reserved. 6
7 second holding six total copies of the data (for fault tolerance purposes) and the third data center holding three copies: CREATE KEYSPACE Globalbiz WITH strategy_class = 'NetworkTopologyStrategy' AND strategy_options:dc1 = 6 AND strategy_options:dc2 = 6 AND strategy_options:dc3 = 3; Once this command successfully executes, all data will then be automatically and transparently replicated between all nodes in all data centers with no further work being necessary on the part of any developer or administrator. Multi-Data Center Performance One reason for multi-data center deployment is to keep copies of a database close to users of a particular data center/geographic region, with the end result being faster performance for both reads and writes. But what about performance across data centers? Won t updating many nodes in many different data centers put too heavy a load on a database cluster? To eliminate this concern, Cassandra has built-in intelligence to only send a single data stream from one data center to all others participating in a multi-data center cluster. Once the data has reached one of the nodes in a different data center, that node then takes the responsibility to update all other nodes in a cluster that are responsible for holding that piece of data. Figure 2: Cross-Data Center Writes in Cassandra 2012 DataStax. All rights reserved. 7
8 Running Apache Hadoop and Solr Across Multiple Data Centers In addition to managing real-time data across multiple data centers, many modern businesses also wish to run analytic and enterprise search operations that span more than one data center. The most popular open source options today are Apache Hadoop for analytic work and Apache Solr for enterprise search. As with real-time data, implementing cross-data center operations for analytics and search data has proven to be no easy task. Options for Multi-Data Center Hadoop and Solr The need for multi-data center support for analytics and enterprise search has not been lost on those developing and supporting Hadoop and Solr. Today, Apache Hadoop offers a warm standby option that can be configured to go to a different data center. Third-party Hadoop vendors also offer solutions with one-way mirror capabilities. For Solr, writes to Solr indexes in the community version of Solr cannot span multiple data centers. Instead, there is only replication support to another node in a different data center via rsync. Both the open source versions of Hadoop and Solr as well as those offered by third-party software vendors miss the mark where the criteria for operating a datastore in a multi-data center environment is concerned. However, DataStax Enterprise, offered by DataStax, supplies not only multi-data center support that meets the criteria suggested earlier in this paper for real-time data, but also delivers the same enterprise support for multi-data center Hadoop and Solr DataStax. All rights reserved. 8
9 A Look at DataStax Enterprise DataStax is the most trusted provider of Cassandra, employing the Apache chair of the Cassandra project as well as most of the committers. For enterprises that want to use Cassandra in production, DataStax supplies DataStax Enterprise Edition, which includes an enterprise-ready version of Cassandra plus integration with Hadoop and Solr. With DataStax Enterprise, modern businesses get a complete big data platform that contains: A certified version of Cassandra that has passed DataStax s rigorous internal certification process, which includes heavy quality assurance testing, performance benchmarking, and more. An integrated Apache Hadoop distribution for analytic operations that includes MapReduce, Hive, Pig, Mahout, and Sqoop support. Bundled enterprise search support with Apache Solr. An enterprise version of DataStax OpsCenter, a visual management and monitoring tool. Expert, 24x7x365 production support. Certified maintenance releases. Multi-Data Center Support for Hadoop Because DataStax Enterprise is built on Apache Cassandra, it inherits all of Cassandra s strengths where multi-data center support is concerned. In addition to multi-data center capabilities for real-time data management with Cassandra, users of DataStax Enterprise also get full cross-data center support for Hadoop and Solr. Built into DataStax Enterprise is an enhanced Hadoop distribution that utilizes Cassandra for many of its core services. DataStax Enterprise provides integrated Hadoop MapReduce, Hive, Pig, Mahout, and Sqoop, replacing the Hadoop Distributed File System (HDFS) storage layer with Cassandra (the Cassandra File System or CFS). The end product is a single integrated solution that provides increased reliability, simpler deployment, and lower total cost of ownership (TCO) than a traditional Hadoop solution. DataStax Enterprise is also fully compatible with existing HDFS and all Hadoop tools and utilities. Another benefit of using Hadoop in DataStax Enterprise is that it eliminates the complexity and single points of failure of the typical HDFS layer. From an operational standpoint, there is no need to set up a Hadoop name node, secondary name node, Zookeeper, and so on DataStax. All rights reserved. 9
10 From a multi-data center perspective, DataStax Enterprise also provides the ability to have a single Hadoop cluster run across as many data centers as desired. Data added to any Hadoop node in any data center is ready for use at all other sites. Plus, multiple CFSs and Hadoop job trackers can be configured so that each data center has its own local data and job trackers to work with, which increases performance. Multi-Data Center Support for Solr DataStax Enterprise includes strong enterprise search support via Lucene and Apache Solr. By integrating Solr into the DataStax Enterprise big data platform, DataStax extends Solr s capabilities and delivers the following: An easily scalable search platform 100 percent data durability No single point of failure No write bottleneck Automatic data sharding Multi-data center capabilities Easy, ad-hoc index rebuilds The ability to query search data with Cassandra s CQL In the same way that DataStax Enterprise takes Hadoop and delivers a continuously available, dynamically scalable, and multi-data center-capable Hadoop/analytics system, it automatically does the same thing for Solr and enterprise search operations. Using Cassandra as the underlying foundation, DataStax Enterprise allows search data to be written to any participating data center. Those currently using Solr will be right at home with DataStax Enterprise. The solution is 100 percent Solr compatible, with all Solr utilities, APIs, and so on, included. What About the Cloud? Both Cassandra and DataStax Enterprise are fully cloud-enabled and capable of supporting multiple geo-zone sites in a cloud provider. Further, hybrid deployments are supported so that a single cluster can span multiple on-premise installations as well as cloud-based implementations DataStax. All rights reserved. 10
11 Figure 4: Cassandra supports hybrid on-premise/cloud deployments Managing and Monitoring Multi-Data Center Deployments Administering and monitoring the performance of any distributed database system can be challenging, especially when the database spans multiple geographical locations. However, DataStax makes it easy to manage multi-data center databases with DataStax OpsCenter. DataStax OpsCenter is a visual management and monitoring solution for Cassandra and other big data technologies such as Apache Hadoop and Solr. Because DataStax OpsCenter is webbased, developers or administrators can easily manage and monitor all aspects of their databases from any desktop, laptop, or tablet without installing any client software. This includes databases that span multiple data centers and the cloud. Figure 5: Managing a 9-node Cassandra cluster with DataStax OpsCenter 2012 DataStax. All rights reserved. 11
12 Multi-Data Center Customer Examples Many modern businesses and organizations are using Cassandra for critical applications today. Here are just some examples: Figure 6: A sample of companies and organizations using Cassandra in production Some DataStax customers using Cassandra and DataStax Enterprise across multiple data centers and the cloud include: Netflix has over 500 nodes of Cassandra running in multiple clusters and geo-zones on Amazon. ebay has over 200 TB in DataStax Enterprise across three data centers. HealthX supports their online patient and provider portal with DataStax Enterprise running in multiple geographies on Amazon. ReachLocal uses DataStax Enterprise in six different data centers across the world to support their global online advertising business. Williams-Sonoma runs Cassandra across multiple sites to support their retail website operations. Pantheon Systems uses Cassandra across multiple data centers to deliver their cloudbased web development platform. Scandit runs Cassandra across three different data centers to support its mobile barcode and product scanning service DataStax. All rights reserved. 12
13 Conclusion Today s successful businesses are looking for a modern database management system that can easily span multiple data centers and handle real-time, analytic, and enterprise search operations. Cassandra and DataStax Enterprise meet the requirements these businesses have for multi-data center and cloud support. To find out more about Cassandra and DataStax, and to obtain downloads of Cassandra and DataStax Enterprise software, please visit or send an to info@datastax.com. Note that DataStax Enterprise Edition is completely free to use in development environments, while production deployments require the purchase of a software subscription. About DataStax DataStax, the commercial leader in Apache Cassandra, offers products and services that make it easy for customers to build, deploy, and operate big data applications. Over 190 customers use DataStax today, including leaders such as Netflix, Cisco, Rackspace, and Constant Contact, with industries served including web, financial services, telecommunications, logistics, and government. DataStax is backed by industry-leading investors, including Lightspeed Venture and Crosslink, and is based in San Mateo, CA, with offices also in Austin, TX. For more information, visit DataStax. All rights reserved. 13
Introduction to Multi-Data Center Operations with Apache Cassandra and DataStax Enterprise
Introduction to Multi-Data Center Operations with Apache Cassandra and DataStax Enterprise White Paper BY DATASTAX CORPORATION October 2013 1 Table of Contents Abstract 3 Introduction 3 The Growth in Multiple
More informationComparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS)
Comparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS) White Paper BY DATASTAX CORPORATION August 2013 1 Table of Contents Abstract 3 Introduction 3 Overview of HDFS 4
More informationIntroduction to Apache Cassandra
Introduction to Apache Cassandra White Paper BY DATASTAX CORPORATION JULY 2013 1 Table of Contents Abstract 3 Introduction 3 Built by Necessity 3 The Architecture of Cassandra 4 Distributing and Replicating
More informationComparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS) WHITE PAPER
Comparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS) WHITE PAPER By DataStax Corporation September 2012 Contents Introduction... 3 Overview of HDFS... 4 The Benefits
More informationBig Data: Beyond the Hype
Big Data: Beyond the Hype Why Big Data Matters to You WHITE PAPER By DataStax Corporation March 2012 Contents Introduction... 3 Big Data and You... 5 Big Data Is More Prevalent Than You Think... 5 Big
More informationBig Data: Beyond the Hype
Big Data: Beyond the Hype Why Big Data Matters to You WHITE PAPER Big Data: Beyond the Hype Why Big Data Matters to You By DataStax Corporation October 2011 Table of Contents Introduction...4 Big Data
More informationThe Modern Online Application for the Internet Economy: 5 Key Requirements that Ensure Success
The Modern Online Application for the Internet Economy: 5 Key Requirements that Ensure Success 1 Table of Contents Abstract... 3 Introduction... 3 Requirement #1 Smarter Customer Interactions... 4 Requirement
More informationTable of Contents... 2
Why NoSQL? Table of Contents Table of Contents... 2 Abstract... 3 Introduction... 3 You Have Big Data... 3 How Does DataStax Helps Manage Big Data... 3 Big Data Performance... 4 You Need Continuous Availability...
More informationBig Data: Beyond the Hype. Why Big Data Matters to You. White Paper
Big Data: Beyond the Hype Why Big Data Matters to You White Paper BY DATASTAX CORPORATION October 2013 Table of Contents Abstract 3 Introduction 3 Big Data and You 5 Big Data Is More Prevalent Than You
More informationEvaluating Apache Cassandra as a Cloud Database WHITE PAPER
Evaluating Apache Cassandra as a Cloud Database WHITE PAPER By DataStax Corporation March 2012 Contents Introduction... 3 Why Move to a Cloud Database?... 3 The Cloud Promises Transparent Elasticity...
More informationEvaluating Apache Cassandra as a Cloud Database WHITE PAPER
Evaluating Apache Cassandra as a Cloud Database WHITE PAPER Evaluating Apache Cassandra as a Cloud Database By DataStax Corporation November 2011 Contents Introduction... 3 Why Move to a Cloud Database?...
More informationEvaluating Apache Cassandra as a Cloud Database White Paper
Evaluating Apache Cassandra as a Cloud Database White Paper BY DATASTAX CORPORATION October 2013 1 Table of Contents Abstract 3 Introduction 3 Why Move to a Cloud Database? 3 The Cloud Promises Transparent
More informationHow Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns
How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns Table of Contents Abstract... 3 Introduction... 3 Definition... 3 The Expanding Digitization
More informationHighly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014
Highly available, scalable and secure data with Cassandra and DataStax Enterprise GOTO Berlin 27 th February 2014 About Us Steve van den Berg Johnny Miller Solutions Architect Regional Director Western
More informationDon t Let Your Shoppers Drop; 5 Rules for Today s Ecommerce A guide for ecommerce teams comprised of line-of-business managers and IT managers
Don t Let Your Shoppers Drop; 5 Rules for Today s Ecommerce A guide for ecommerce teams comprised of line-of-business managers and IT managers White Paper BY DATASTAX CORPORATION AUGUST 2013 Table of Contents
More informationComplying with Payment Card Industry (PCI-DSS) Requirements with DataStax and Vormetric
Complying with Payment Card Industry (PCI-DSS) Requirements with DataStax and Vormetric Table of Contents Table of Contents... 2 Overview... 3 PIN Transaction Security Requirements... 3 Payment Application
More informationCloudwick. CLOUDWICK LABS Big Data Research Paper. Nebula: Powering Enterprise Private & Hybrid Cloud for DataStax Big Data
Nebula: Powering Enterprise Private & Hybrid Cloud for DataStax Big Data was commissioned to evaluate and test the Nebula One Private and Hybrid Cloud Appliance using DataStax, a leading Apache Cassandra
More informationImplementing Search in Web, Mobile, and IOT Applications An Overview of DataStax Enterprise Search
Implementing Search in Web, Mobile, and IOT Applications An Overview of DataStax Enterprise Search Table of Contents Introduction... 3 Why Search?... 3 General Search Requirements... 3 Traditional Deployment
More informationSo What s the Big Deal?
So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data
More informationSimplifying Database Management with DataStax OpsCenter
Simplifying Database Management with DataStax OpsCenter Table of Contents Table of Contents... 2 Abstract... 3 Introduction... 3 DataStax OpsCenter... 3 How Does DataStax OpsCenter Work?... 3 The OpsCenter
More informationBIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
More informationINTRODUCTION TO CASSANDRA
INTRODUCTION TO CASSANDRA This ebook provides a high level overview of Cassandra and describes some of its key strengths and applications. WHAT IS CASSANDRA? Apache Cassandra is a high performance, open
More informationComparing Oracle with Cassandra / DataStax Enterprise
Comparing Oracle with Cassandra / DataStax Enterprise Table of Contents Table of Contents... 2 Abstract... 3 Introduction... 3 Oracle and Today s Online Applications... 3 Architectural Limitations... 3
More informationElasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack
Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack HIGHLIGHTS Real-Time Results Elasticsearch on Cisco UCS enables a deeper
More informationMulti-Datacenter Replication
www.basho.com Multi-Datacenter Replication A Technical Overview & Use Cases Table of Contents Table of Contents... 1 Introduction... 1 How It Works... 1 Default Mode...1 Advanced Mode...2 Architectural
More informationEnabling SOX Compliance on DataStax Enterprise
Enabling SOX Compliance on DataStax Enterprise Table of Contents Table of Contents... 2 Introduction... 3 SOX Compliance and Requirements... 3 Who Must Comply with SOX?... 3 SOX Goals and Objectives...
More informationSearch and Real-Time Analytics on Big Data
Search and Real-Time Analytics on Big Data Sewook Wee, Ryan Tabora, Jason Rutherglen Accenture & Think Big Analytics Strata New York October, 2012 Big Data: data becomes your core asset. It realizes its
More informationObject Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.
Object Storage: A Growing Opportunity for Service Providers Prepared for: White Paper 2012 Neovise, LLC. All Rights Reserved. Introduction For service providers, the rise of cloud computing is both a threat
More informationNoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
More informationBig Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect
Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate
More informationWelcome to Apache Cassandra 1.0
Welcome to Apache Cassandra 1.0 An Overview for Architects, Developers, and IT Managers WHITE PAPER Welcome to Apache Cassandra 1.0 An Overview for Architects, Developers, and IT Managers By DataStax Corporation
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationHadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
More informationWhite Paper. Managing MapR Clusters on Google Compute Engine
White Paper Managing MapR Clusters on Google Compute Engine MapR Technologies, Inc. www.mapr.com Introduction Google Compute Engine is a proven platform for running MapR. Consistent, high performance virtual
More informationBig Data Analytics for Space Exploration, Entrepreneurship and Policy Opportunities. Tiffani Crawford, PhD
Big Analytics for Space Exploration, Entrepreneurship and Policy Opportunities Tiffani Crawford, PhD Big Analytics Characteristics Large quantities of many data types Structured Unstructured Human Machine
More informationThe Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
More informationHadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics
In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning
More informationINTRODUCTION. Specifically we looked at:
3 INTRODUCTION The Evolve IP-CCNG 2014 North American Call Center Survey Results Paper examined the trends, concerns and spending in today s call centers. Specifically we looked at: Cloud-based versus
More informationDataStax Enterprise Reference Architecture
DataStax Enterprise Reference Architecture DataStax Enterprise Reference Architecture 7.8.15 1 Table of Contents ABSTRACT... 3 INTRODUCTION... 3 DATASTAX ENTERPRISE... 3 ARCHITECTURE... 3 OPSCENTER: EASY-
More informationNon-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014
Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ Cloudera World Japan November 2014 WANdisco Background WANdisco: Wide Area Network Distributed Computing Enterprise ready, high availability
More informationTHE REALITIES OF NOSQL BACKUPS
THE REALITIES OF NOSQL BACKUPS White Paper Trilio Data, Inc. March 2015 1 THE REALITIES OF NOSQL BACKUPS TABLE OF CONTENTS INTRODUCTION... 2 NOSQL DATABASES... 2 PROBLEM: LACK OF COMPREHENSIVE BACKUP AND
More informationModern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers
Modern IT Operations Management Why a New Approach is Required, and How Boundary Delivers TABLE OF CONTENTS EXECUTIVE SUMMARY 3 INTRODUCTION: CHANGING NATURE OF IT 3 WHY TRADITIONAL APPROACHES ARE FAILING
More informationArchitectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
More informationCloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
More informationDataStax Enterprise, powered by Apache Cassandra (TM)
PerfAccel (TM) Performance Benchmark on Amazon: DataStax Enterprise, powered by Apache Cassandra (TM) Disclaimer: All of the documentation provided in this document, is copyright Datagres Technologies
More informationIntroduction to Cassandra
Introduction to Cassandra DuyHai DOAN, Technical Advocate Agenda! Architecture cluster replication Data model last write win (LWW), CQL basics (CRUD, DDL, collections, clustering column) lightweight transactions
More informationBIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES
BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES Relational vs. Non-Relational Architecture Relational Non-Relational Rational Predictable Traditional Agile Flexible Modern 2 Agenda Big Data
More informationWhy Migrate from MySQL to Cassandra?
Why Migrate from MySQL to Cassandra? White Paper BY DATASTAX CORPORATION June 2012 1 Table of Contents Abstract 3 Introduction 3 Why Stay with MySQL 4 Why Migrate from MySQL? 4 Architectural Limitations
More informationAffordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale
WHITE PAPER Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale Sponsored by: IBM Carl W. Olofson December 2014 IN THIS WHITE PAPER This white paper discusses the concept
More informationBig Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationDominik Wagenknecht Accenture
Dominik Wagenknecht Accenture Improving Mainframe Performance with Hadoop October 17, 2014 Organizers General Partner Top Media Partner Media Partner Supporters About me Dominik Wagenknecht Accenture Vienna
More informationwww.basho.com Technical Overview Simple, Scalable, Object Storage Software
www.basho.com Technical Overview Simple, Scalable, Object Storage Software Table of Contents Table of Contents... 1 Introduction & Overview... 1 Architecture... 2 How it Works... 2 APIs and Interfaces...
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationMakeMyTrip CUSTOMER SUCCESS STORY
MakeMyTrip CUSTOMER SUCCESS STORY MakeMyTrip is the leading travel site in India that is running two ClustrixDB clusters as multi-master in two regions. It removed single point of failure. MakeMyTrip frequently
More informationQsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
More informationA Survey of Distributed Database Management Systems
Brady Kyle CSC-557 4-27-14 A Survey of Distributed Database Management Systems Big data has been described as having some or all of the following characteristics: high velocity, heterogeneous structure,
More informationNoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
More informationAccelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationIntroduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
More informationScaleArc for SQL Server
Solution Brief ScaleArc for SQL Server Overview Organizations around the world depend on SQL Server for their revenuegenerating, customer-facing applications, running their most business-critical operations
More informationIntel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013
Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software SC13, November, 2013 Agenda Abstract Opportunity: HPC Adoption of Big Data Analytics on Apache
More informationGet More Scalability and Flexibility for Big Data
Solution Overview LexisNexis High-Performance Computing Cluster Systems Platform Get More Scalability and Flexibility for What You Will Learn Modern enterprises are challenged with the need to store and
More informationMonitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
More informationCitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
More informationUbuntu and Hadoop: the perfect match
WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely
More informationComprehensive Analytics on the Hortonworks Data Platform
Comprehensive Analytics on the Hortonworks Data Platform We do Hadoop. Page 1 Page 2 Back to 2005 Page 3 Vertical Scaling Page 4 Vertical Scaling Page 5 Vertical Scaling Page 6 Horizontal Scaling Page
More informationThe Production Cloud
The Production Cloud The cloud is not just for backup storage, development projects and other low-risk applications. In this document, we look at the characteristics of a public cloud environment that
More informationI/O Considerations in Big Data Analytics
Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very
More informationBig Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum
Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All
More informationWhite Paper: What You Need To Know About Hadoop
CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack
More informationInfomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
More informationWOS Cloud. ddn.com. Personal Storage for the Enterprise. DDN Solution Brief
DDN Solution Brief Personal Storage for the Enterprise WOS Cloud Secure, Shared Drop-in File Access for Enterprise Users, Anytime and Anywhere 2011 DataDirect Networks. All Rights Reserved DDN WOS Cloud
More informationChukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
More informationAdvanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
More informationHealthCare Anytime. As we approach the 2020s, the trend toward big data, tools, and systemization
Datastax Provides with a Strategic Competitive Advantage as They Improve Patients Medical Care Executive Summary For more than 20 years, much of the national debate on reforming health care has focused
More informationLambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
More informationHow To Use Hp Vertica Ondemand
Data sheet HP Vertica OnDemand Enterprise-class Big Data analytics in the cloud Enterprise-class Big Data analytics for any size organization Vertica OnDemand Organizations today are experiencing a greater
More informationSQLstream Blaze and Apache Storm A BENCHMARK COMPARISON
SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON 2 The V of Big Data Velocity means both how fast data is being produced and how fast the data must be processed to meet demand. Gartner The emergence
More informationBASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS
WHITEPAPER BASHO DATA PLATFORM BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS INTRODUCTION Big Data applications and the Internet of Things (IoT) are changing and often improving our
More informationHow To Use Big Data For Telco (For A Telco)
ON-LINE VIDEO ANALYTICS EMBRACING BIG DATA David Vanderfeesten, Bell Labs Belgium ANNO 2012 YOUR DATA IS MONEY BIG MONEY! Your click stream, your activity stream, your electricity consumption, your call
More informationHigh Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper
High Availability with Postgres Plus Advanced Server An EnterpriseDB White Paper For DBAs, Database Architects & IT Directors December 2013 Table of Contents Introduction 3 Active/Passive Clustering 4
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationBig data blue print for cloud architecture
Big data blue print for cloud architecture -COGNIZANT Image Area Prabhu Inbarajan Srinivasan Thiruvengadathan Muralicharan Gurumoorthy Praveen Codur 2012, Cognizant Next 30 minutes Big Data / Cloud challenges
More informationNo-SQL Databases for High Volume Data
Target Conference 2014 No-SQL Databases for High Volume Data Edward Wijnen 3 November 2014 The New Connected World Needs a Revolutionary New DBMS Today The Internet of Things 1990 s Mobile 1970 s Mainfram
More informationBig Data Explained. An introduction to Big Data Science.
Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of
More informationManaging Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
More informationT a c k l i ng Big Data w i th High-Performance
Worldwide Headquarters: 211 North Union Street, Suite 105, Alexandria, VA 22314, USA P.571.296.8060 F.508.988.7881 www.idc-gi.com T a c k l i ng Big Data w i th High-Performance Computing W H I T E P A
More informationDBA'S GUIDE TO NOSQL APACHE CASSANDRA
DBA'S GUIDE TO NOSQL APACHE CASSANDRA THE ENLIGHTENED DBA Smashwords Edition Copyright 2014 The Enlightened DBA This ebook is licensed for your personal enjoyment only. This ebook may not be re-sold or
More informationHow To Make Data Streaming A Real Time Intelligence
REAL-TIME OPERATIONAL INTELLIGENCE Competitive advantage from unstructured, high-velocity log and machine Big Data 2 SQLstream: Our s-streaming products unlock the value of high-velocity unstructured log
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationSOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
More informationWhy NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1
Why NoSQL? Your database options in the new non- relational world 2015 IBM Cloudant 1 Table of Contents New types of apps are generating new types of data... 3 A brief history on NoSQL... 3 NoSQL s roots
More informationDelivering Real-World Total Cost of Ownership and Operational Benefits
Delivering Real-World Total Cost of Ownership and Operational Benefits Treasure Data - Delivering Real-World Total Cost of Ownership and Operational Benefits 1 Background Big Data is traditionally thought
More informationENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE
ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how EMC Elastic Cloud Storage (ECS ) can be used to streamline the Hadoop data analytics
More informationApache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer aaa@cloudera.com, twitter: @awadallah Hadoop Past
More informationCan the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
More information