ScaleDB Managing Streams of Time Series Data

Size: px
Start display at page:

Download "ScaleDB Managing Streams of Time Series Data"

Transcription

1 ScaleDB Managing Streams of Time Series Data

2 Using ScaleDB to Manage Streams of Time Series Data Abstract Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams which need to be analyzed. Streaming data can be considered as one of the main sources of Big Data. However, mining Big Data streams faces two principal challenges: Velocity the frequency of data generation or frequency of data delivery. And Volume the amount of data accumulated. As many developers found traditional databases lacking support for Big Data, specialized solutions were introduced. Currently, most Big Data projects are using Hadoop. However, Hadoop is a batch process that mainly addresses the Volume challenge. To address the Velocity challenge, different architectural solutions are tailored around Hadoop which are referred to as the Lambda architectures. Examples are Storm, Druid and Cassandra that are set to process incoming data prior to processing the data in Hadoop. For time series data we propose a different approach. ScaleDB keeps the data in the database making the solution SQL based, simple to develop and deploy and cost effective to operate and maintain. Scaling is done similar to Hadoop by adding nodes to a cluster, however, unlike Hadoop, insertion is done in next to real time supporting Velocity that can exceed millions of inserts per second. In particular, performance remains constant when data volume grows. ScaleDB is disk based, offering TCO advantages over in-memory based solutions. ScaleDB leverages the entire MySQL ecosystem of tools, people and process, avoiding the extremely complex Lambda deployment. Lastly, ScaleDB is row based, providing a unified and integrated solution to manage streaming, operational and historical data. ScaleDB is a platform of database and storage nodes connected by a network. These connected machines form a cluster of database and storage nodes that processes data in 2 tiers: a database tier and a storage tier. The storage tier provides a shared data container for the database nodes in the database tier. The overall cluster is a general purpose, shared disk, row based database for Big Data. It includes special optimizations to provide high insertion rates while supporting BI queries that evaluate data by time. The BI queries are pushed to the storage nodes such that they are satisfied with a high degree of parallelism. Our experience and performance studies based on standard commodity hardware show over a million inserts per second while concurrent BI queries evaluate hundreds of millions of rows per second with next to linear scalability achieved by adding database nodes and storage nodes to the cluster. 1. Introduction Queries that evaluate data by time specify a time interval in which a property or a behavior needs to be determined and monitored. These types of queries are frequently used to support a variety of businesses requirements. Examples include: determining sales trends, fraud detection, monitoring user s behaviors based on click streams and machine learning. A new and growing market is the Internet of Things where devices generate enormous amounts of data that need to be evaluated over time. Applications that need to analyze data to satisfy these types of queries are challenged with huge amounts of data, high velocity data and requirements for real time analysis. In addition, many applications not only require the summary information, they are also interested in the source data. For example, in a security application, the same data set that alerts on a potential security break by analyzing the number of entries to a particular IP and port may be required to present the details of the security event. To trace the root cause, a troubleshooter may need to know all the source and destination IPs and ports interacted with over a given time. ScaleDB processes are based on source data and therefore can satisfy queries on individual rows. ScaleDB is using MySQL/MariaDB as the interface making it simple to deploy and manage and fully compatible with the entire MySQL ecosystem. ScaleDB provides the best price performance solution for processing and analyzing time-series data. Table of Contents: Section 2 The ScaleDB System Overview Section 3 Data Organization Section 4 Local Time Based Indexes Section 5 Global Hash Indexes ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 1

3 Section 6 Query Push Down Mechanism Section 7 Lock Manager Section 8 Experimental Results: Section 9 Conclusions 2. The ScaleDB System Overview ScaleDB provides a unified approach to manage data streams, historical data and operational data. Tables are defined and data is processed using SQL. Scaling is achieved by adding nodes to a cluster without the need to shard/partition the data. As it will be explained and demonstrated, insert and query scaling is linear by adding nodes to the cluster and data volumes have no impact on performance. ScaleDB is a disk based solution - the amount of memory (RAM) required is orders of magnitude smaller than the total size of the data managed. This approach provides a much better TCO than alternative memory based DBMS solutions. This document will show how ScaleDB manages time series data very efficiently by overcoming insertion and query row retrieval performance barriers of standard disk-based solutions. ScaleDB is a relational, transactional, shared disk, clustered database using 3 major software components: a) Storage s - These nodes maintain the data. The data is organized in rows which are contained in disk based blocks. Frequently used blocks are maintained in a cache layer and less frequently used blocks are read from disk. With multiple storage nodes, the data is striped over the different nodes such that every node manages a portion of the data. b) Database s - These nodes receive and process low level logical requests to manipulate the data. For example, a request to insert a new row or a request to retrieve a row by a particular key. This functionality is triggered by a call to a ScaleDB native API. Each database node is able to read and write to the entire data set (read from and write to each storage node). When a database node processes data, it reads the data to its local cache from the storage node that maintains the data and when a transaction commits, the database node transfers the data that was updated to the storage tier. ScaleDB does not provide a SQL parser but offers an interface to MariaDB - MariaDB treats ScaleDB as a native storage engine and is not aware of the cluster. The cluster level processes are managed by ScaleDB. c) Lock Manager. As all the data is available to all the database nodes in the cluster, the lock manager provides a locking mechanism that synchronizes lock requests (of the shared resources) of different database nodes in the cluster. The logic and performance considerations for the lock manager are explained in section 6 below. These nodes are connected by a network and provide a 2 tier platform: 1. The database tier is a collection of database nodes with MariaDB as the user interface and ScaleDB as the database engine. 2. The storage tier is a collection of storage nodes. Each node is configured and tuned to satisfy IO requests from the database nodes. The storage tier presents a complete and consistent view of the entire data set to each of the database nodes in the cluster then each database node presents a complete and consistent view of the data to the application. ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 2

4 DBMS DBMS DBMS Cluster Manager Storage Storage Storage Storage A ScaleDB Cluster HA of the database tier - since data is shared among all the database nodes, if a database node fails, queries can be routed to any of the surviving nodes. HA of the distributed lock manager is done by maintaining a standby lock manager that manages the cluster if the active lock manager fails. To explain how HA is maintained in the storage tier we define a concept of a volume. A volume is a logical container that maintains a portion of the data. For example, with 10 volumes the data is stripped over the 10 volumes such that every volume has one tenth of the data. Practically, a volume can be supported by one storage node. However, to provide HA, a volume is supported by 2 or more nodes and when data is sent to be written in a volume, it is written to all the nodes that support the volume. Therefore, if a storage node fails, the data in a particular volume remains available by a different node in the volume and the system continues to run. For regular tables, ScaleDB is using general purpose indexes which are based on unique, balanced, Patricia Trie structures. These were detailed in These indexes are global indexes across the cluster. That means that these indexes are treated as shared resources by the different database nodes in the cluster. In this paper we detail a new approach to manage Big Data in a clustered database with special optimizations for time series data. These optimizations allow high insertion rate, efficient query of rows by key values and efficient queries that filter and group data. Our method defines a special table in the database. The table is called Streaming Table and it has the following characteristics: 1. The primary key is a unique number which is automatically generated. 2. One of the columns in the table is a time stamp which is automatically generated and represents the insert time of each row in the database. 3. The data of the table is stripped over multiple storage nodes in the storage tier. 4. Each storage node maintains a local index over the data by the time stamps. 5. In every storage node, the data is sequentially organized by time. 6. The global indexes are based on hash structures. 7. Queries that evaluate data within a time interval leverage a pushdown mechanism. 3. Data Organization ScaleDB operates multiple database instances on different servers in the cluster against a shared set of data files. Each data file spans multiple hardware systems and yet appears as a single unified dataset to the database node. This enables the utilization of commodity hardware to reduce total cost of ownership and to provide a scalable computing environment that supports massive data volumes. ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 3

5 Each database instance considers the data as organized in tables whereas each table organizes the data in uniform rows. Logically, tables appear to the database instances as a continues set of rows. However, the physical implementation is different to support performance scalability and HA. ScaleDB places the data in volumes. A volume is a logical unit that maintains data. With N volumes, each volume manages 1/N of the data. A volume is supported by one or more storage nodes which are commodity machines (or virtual machines). When data is written, it is written to all the nodes assigned to the volume. If a volume is supported by 2 nodes, the data is written twice and therefore the system continues to run if a storage node fails. When a row is written, it is placed in a disk based block and each block is shipped to one volume in the storage tier. As volumes are randomly picked, ScaleDB achieves an even distribution of the data among the volumes and avoids hotspots. In each storage node, the blocks are organized sequentially in one or more files and these files are dynamically mapped to the logical tables that are considered and processed by the database nodes. Similarly, a table may be supported by indexes that are organized in disk based blocks and are randomly placed in the different volumes. The outcome of this approach is that the data is stripped across multiple volumes and the IO operations are efficient as they leverage all the resources (CPU, memory, disks) of all the nodes on the storage tier. Scaling of the storage tier is simple as it only requires additional nodes. However, for high volume and velocity with disk based data, a traditional indexing approach may still be insufficient. If we expect hundreds of thousands or more inserts per second, a disk based solution that requires random IOs will not be able to support the insert rate efficiently. For queries, if rows are retrieved by an index from disk, even a single random IO per row would make the query slow when millions of rows are retrieved (1,000,000 reads assuming 4ms per IO is more than an hour). ScaleDB uses 2 techniques to support huge insertion rates and queries that evaluate large data sets. The first is a special organization of the data and the second is leveraging high degree of parallelism. In order to avoid the random seeks, time series data is organized by time such that queries by time read data sequentially. This organization allows for efficient retrieval of rows in a given time range. If the first block of a time range is available (see section 4 below), than all the rows of the time range can be efficiently retrieved by sequentially reading consecutive blocks. Therefore, when rows are retrieved by time, a sequential scan replaces the random seek of a conventional index. As the data of a table is evenly distributed on all the volumes, the scan (and sometimes the query see section 6) is pushed to all the volumes such that multiple machines concurrently retrieve the data. This approach provides high degree of parallelism. The degree of parallelism can be increased by adding volumes to the cluster - The additional nodes provide additional resources and increase parallelism such that query performance for time series data can be set to meet any performance requirement. 4. Local Storage Indexes When a block is written to a storage node, the block is indexed such that it is possible to locate the block by the time stamps represented in the time column of the rows contained in the block. A search for data by time is processed by each storage node independently and aggregated by the database node that initiated the query (see section 6). The process in each storage node is done in 2 steps: locating the first block with the needed time stamp and retrieving the rest of the data sequentially. The search ends when the data examined is outside the needed time range. This approach scales linearly by adding storage nodes to the cluster. In addition, with this approach, the data volume does not impact the retrieval time. 5. Global Cluster (hash based) Indexes The local indexes (in section 4) support queries that evaluate data by time. However, many applications require other views to the data. For example, a query may need to view the rows of a particular device, or all the rows relating to a particular customer. These queries are supported by the Global Indexes. For a streaming table the global indexes are hash based. These indexes replaced our general purpose indexes (that are used to index a non-streaming table) as they have a very small impact on the insertion time. With a standard tree based index, every insert and search required analyzing and synchronizing multiple index blocks (root to leaf) and (in the case of insert) updating the leaf block of the ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 4

6 index. The hash indexes are simpler and faster and are processed with less contention (between nodes of the cluster). With the hash based indexes, inserts are done at a high rate while providing different views over the data. 6. The Query Pushdown Mechanism The ScaleDB Pushdown Mechanism enables certain types of query processing to be done in the storage nodes. With Pushdown technology, the database nodes send query details to the storage nodes. With this information, the storage nodes can take over a large portion of the data-intensive query processing. The Pushdown Mechanism can search storage nodes with added intelligence about the query and send only the relevant bytes, not all the database blocks, to the database nodes. The performance benefits of this approach are as follows: 1. Parallel processing: every query is supported by multiple storage nodes. With 10 volumes, 10 machines execute the query. Each machine contributes resources (CPU, memory) to support the query and all machines work in parallel. 2. Rather than shipping large amount of data to the database node, the query is executed next to the data and only a small amount of data is passed over the network to the database node. 3. Query performance can increase (linearly) by adding storage nodes to the cluster. 7. The lock manager With a non-clustered database, lock conflicts between different processes are resolved by a lock manager. The different threads that operate over the data use a structure in a shared memory space that represents the locking status, pending requests and a process that resolves conflicts. We call the lock manager that operates within a shared memory space a local lock manager. However, a local lock manager is not sufficient to resolve conflicts between processes on different database nodes. As these processes do not have a shared memory, synchronization is done by messages that are sent to and from a distributed lock manager. As messaging is orders of magnitude slower than updating a memory structure, a special locking process was developed to overcome the performance challenges. We call this process Lock Taken. In a Lock Taken process, a node (in the clustered environment) is able to identify that there are no conflicting requests on a shared resource and issue an asynchronous lock request to the lock manager. This message informs the lock manager that a lock over a particular resource was taken and triggers a process in the distributed lock manager that updates its internal structures to represent that a particular node is holding a lock on the particular resource. The Lock Taken messages are different from Lock Request messages as they are performed asynchronously. This approach eliminates the performance penalties of the lock requests which require a reply by the distributed lock manager. 8. Experimental Results: A ScaleDB cluster was configured with the following 9 machines: a) Storage s (4 total): 4 X Intel(R) Xeon(R) CPU 2.27GHz (4 core) with 23G mem b) Cluster Manager (1 total): 1 X Intel(R) Core(TM) i GHz (4 core) with 32G mem c) DB nodes (4 total): 3 X Intel(R) Core(TM) i7 CPU 3.07GHz (4 core) with 5g mem 1 X Intel(R) Xeon(R) CPU 2.13GHz (4 core) with 5g mem Each storage node has 4 disks, 15k 600GB each, SAS 6GB/sec throughput, RAID 0, HP ProLiant DL380G ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 5

7 The tests used 2 tables: A streaming table that processed massive amounts of data. CREATE TABLE `payment_scaledb_streaming` ( `sale_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT `STREAMING_KEY`=Yes, `create_time` timestamp NOT NULL DEFAULT ' :00:00', `account` bigint(20) unsigned NOT NULL DEFAULT '0' `HASHKEY`=YES `HashSize`=10000, `store` char(16) NOT NULL DEFAULT '', `product` char(16) NOT NULL DEFAULT '', `coupon` char(7) NOT NULL DEFAULT '', `amount` decimal(8,2) NOT NULL, PRIMARY KEY (`sale_id`), KEY `account` (`account`), KEY `create_time` (`create_time`)) ENGINE=ScaleDB DEFAULT CHARSET=latin1 `Streaming`=YES `RangeKey`=create_time A regular table that maintains information that is joined with the data of the streaming table to satisfy queries. CREATE TABLE `stores_scaledb` ( `name` char(25) NOT NULL DEFAULT '', `street` char(25) DEFAULT NULL, `city` char(25) DEFAULT NULL, `state` char(2) DEFAULT NULL, `zipcode` char(5) DEFAULT NULL, `phone` char(12) DEFAULT NULL, PRIMARY KEY (`name`)) ENGINE=ScaleDB DEFAULT CHARSET=latin1 a) Evaluating Insert performance of a Streaming table For insert performance we considered 3 questions: 1. The insert performance provided by the ScaleDB cluster. 2. What is the impact of data volumes on the performance (would the performance remain constant as the data size increases)? 3. What is the performance factor compared to a standard Insert of a conventional dbms. ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 6

8 Rows Per Second ScaleDB - Streaming table Inserts Time(minute) InnoDB ScaleDB The Insert test shows the following: a. On this particular cluster, Insert rate was next to 1.2 million rows per second. b. Insert rate was constant as data volume grew (for the entire 4 hours). c. Compared to a single, standard DBMS (we used MariaDB with Innodb), performance started at about 100,000 rows per second but quickly degraded to about 5,000 inserts per second. b) Evaluating Query performance of a Streaming table We tested 4 different queries: Query 1 - Count of rows within a time interval. The time interval had about 70 million rows from a table with 4 billion rows. Query 2 - Retrieve rows within a time interval with a filter on some column values. The time interval had about 10 million rows from a table with 4 billion rows and 3,000 rows satisfied the filter conditions. Query 3 - Group and Analyze rows within a time interval. The time interval had about 70 million rows from a table with 4 billion rows. The group by yields 10 rows. Query 4 same as query 3 with a join on each group by row. This query is only executed to show that joins are supported. Note 1: These queries were executed while the database continues to ingest data at the same rate (1.2 million inserts/second). Note 2: We tested standard DBMS solution (MariaDB and InnoDB) executing the same query. ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 7

9 Query Time (seconds) Query Time (seconds) 60 Simple Count of 70M rows InnoDB ScaleDB Filter 10M rows to 3K rows InnoDB ScaleDB ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 8

10 Query Time (seconds) Query Time (seconds) Group and Analyze 70M rows InnoDB ScaleDB Joining Tables InnoDB ScaleDB ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 9

11 9. Conclusions ScaleDB is a 2 tier database cluster: a database tier with multiple database nodes and a storage tier with multiple storage nodes. This environment is a general purpose database platform with a distributed lock manager. The processes in the environment are tuned to support high rates of inserts (millions of rows per second) and efficient queries of time series data (hundreds of millions of rows can be evaluated per second). In the storage tier, the data is stripped into volumes and on each volume organized by time. Each storage node on the storage tier maintains a local index by time. This index locates a starting point for a search from which the time series data is retrieved sequentially. Queries evaluating time series data are pushed to be executed on multiple storage nodes allowing each query to leverage more compute resources than the resources offered by a single machine. This approach offers a huge amount of parallelism. ScaleDB creates a complete solution to manage data. Tables supporting historical and operational data can be extended to manage time series data in a highly efficient way. Data from all tables can be joined to provide a complete and consistent view of the entire data set. Developers looking to scale their databases do not need to redesign or shard their data. This approach scales by adding database and storage nodes to a cluster without the need to partition and shard the data. ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 10

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage

More information

High performance ETL Benchmark

High performance ETL Benchmark High performance ETL Benchmark Author: Dhananjay Patil Organization: Evaltech, Inc. Evaltech Research Group, Data Warehousing Practice. Date: 07/02/04 Email: erg@evaltech.com Abstract: The IBM server iseries

More information

Tier Architectures. Kathleen Durant CS 3200

Tier Architectures. Kathleen Durant CS 3200 Tier Architectures Kathleen Durant CS 3200 1 Supporting Architectures for DBMS Over the years there have been many different hardware configurations to support database systems Some are outdated others

More information

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what

More information

MS SQL Performance (Tuning) Best Practices:

MS SQL Performance (Tuning) Best Practices: MS SQL Performance (Tuning) Best Practices: 1. Don t share the SQL server hardware with other services If other workloads are running on the same server where SQL Server is running, memory and other hardware

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

White Paper. Optimizing the Performance Of MySQL Cluster

White Paper. Optimizing the Performance Of MySQL Cluster White Paper Optimizing the Performance Of MySQL Cluster Table of Contents Introduction and Background Information... 2 Optimal Applications for MySQL Cluster... 3 Identifying the Performance Issues.....

More information

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 F1: A Distributed SQL Database That Scales Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 What is F1? Distributed relational database Built to replace sharded MySQL back-end of AdWords

More information

Optimizing Performance. Training Division New Delhi

Optimizing Performance. Training Division New Delhi Optimizing Performance Training Division New Delhi Performance tuning : Goals Minimize the response time for each query Maximize the throughput of the entire database server by minimizing network traffic,

More information

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013 Big Data Use Case How Rackspace is using Private Cloud for Big Data Bryan Thompson May 8th, 2013 Our Big Data Problem Consolidate all monitoring data for reporting and analytical purposes. Every device

More information

MyISAM Default Storage Engine before MySQL 5.5 Table level locking Small footprint on disk Read Only during backups GIS and FTS indexing Copyright 2014, Oracle and/or its affiliates. All rights reserved.

More information

SQL Server Business Intelligence on HP ProLiant DL785 Server

SQL Server Business Intelligence on HP ProLiant DL785 Server SQL Server Business Intelligence on HP ProLiant DL785 Server By Ajay Goyal www.scalabilityexperts.com Mike Fitzner Hewlett Packard www.hp.com Recommendations presented in this document should be thoroughly

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution Jonathan Halstuch, COO, RackTop Systems JHalstuch@racktopsystems.com Big Data Invasion We hear so much on Big Data and

More information

SQL Server 2012 Performance White Paper

SQL Server 2012 Performance White Paper Published: April 2012 Applies to: SQL Server 2012 Copyright The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication.

More information

CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

More information

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011 SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Data Integrator Performance Optimization Guide

Data Integrator Performance Optimization Guide Data Integrator Performance Optimization Guide Data Integrator 11.7.2 for Windows and UNIX Patents Trademarks Copyright Third-party contributors Business Objects owns the following

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Cloud Based Application Architectures using Smart Computing

Cloud Based Application Architectures using Smart Computing Cloud Based Application Architectures using Smart Computing How to Use this Guide Joyent Smart Technology represents a sophisticated evolution in cloud computing infrastructure. Most cloud computing products

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Use of Hadoop File System for Nuclear Physics Analyses in STAR 1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources

More information

SQL Server 2008 Performance and Scale

SQL Server 2008 Performance and Scale SQL Server 2008 Performance and Scale White Paper Published: February 2008 Updated: July 2008 Summary: Microsoft SQL Server 2008 incorporates the tools and technologies that are necessary to implement

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,

More information

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory) WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...

More information

Java DB Performance. Olav Sandstå Sun Microsystems, Trondheim, Norway Submission ID: 860

Java DB Performance. Olav Sandstå Sun Microsystems, Trondheim, Norway Submission ID: 860 Java DB Performance Olav Sandstå Sun Microsystems, Trondheim, Norway Submission ID: 860 AGENDA > Java DB introduction > Configuring Java DB for performance > Programming tips > Understanding Java DB performance

More information

Optimizing the Performance of Your Longview Application

Optimizing the Performance of Your Longview Application Optimizing the Performance of Your Longview Application François Lalonde, Director Application Support May 15, 2013 Disclaimer This presentation is provided to you solely for information purposes, is not

More information

SCALABLE DATA SERVICES

SCALABLE DATA SERVICES 1 SCALABLE DATA SERVICES 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2 Overview MySQL Database Clustering GlusterFS Memcached 3 Overview Problems of Data Services 4 Data retrieval

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle Agenda Introduction Database Architecture Direct NFS Client NFS Server

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

InfiniteGraph: The Distributed Graph Database

InfiniteGraph: The Distributed Graph Database A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086

More information

Virtuoso and Database Scalability

Virtuoso and Database Scalability Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of

More information

Database Scalability and Oracle 12c

Database Scalability and Oracle 12c Database Scalability and Oracle 12c Marcelle Kratochvil CTO Piction ACE Director All Data/Any Data marcelle@piction.com Warning I will be covering topics and saying things that will cause a rethink in

More information

In Memory Accelerator for MongoDB

In Memory Accelerator for MongoDB In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000

More information

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a

More information

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database WHITE PAPER Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities Technology Insight Paper Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities By John Webster February 2015 Enabling you to make the best technology decisions Enabling

More information

Welcome to Virtual Developer Day MySQL!

Welcome to Virtual Developer Day MySQL! Welcome to Virtual Developer Day MySQL! Keynote: Developer and DBA Guide to What s New in MySQL Andrew Morgan - MySQL Product Management @andrewmorgan www.clusterdb.com 1 Program Agenda 1:00 PM Keynote:

More information

Configuring Apache Derby for Performance and Durability Olav Sandstå

Configuring Apache Derby for Performance and Durability Olav Sandstå Configuring Apache Derby for Performance and Durability Olav Sandstå Database Technology Group Sun Microsystems Trondheim, Norway Overview Background > Transactions, Failure Classes, Derby Architecture

More information

GigaSpaces Real-Time Analytics for Big Data

GigaSpaces Real-Time Analytics for Big Data GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and

More information

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Using an In-Memory Data Grid for Near Real-Time Data Analysis SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses

More information

2009 Oracle Corporation 1

2009 Oracle Corporation 1 The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material,

More information

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices Sawmill Log Analyzer Best Practices!! Page 1 of 6 Sawmill Log Analyzer Best Practices! Sawmill Log Analyzer Best Practices!! Page 2 of 6 This document describes best practices for the Sawmill universal

More information

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale WHITE PAPER Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale Sponsored by: IBM Carl W. Olofson December 2014 IN THIS WHITE PAPER This white paper discusses the concept

More information

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010 Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010 This document is provided as-is. Information and views expressed in this document, including URL and other Internet

More information

Configuring Apache Derby for Performance and Durability Olav Sandstå

Configuring Apache Derby for Performance and Durability Olav Sandstå Configuring Apache Derby for Performance and Durability Olav Sandstå Sun Microsystems Trondheim, Norway Agenda Apache Derby introduction Performance and durability Performance tips Open source database

More information

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Analytics - Accelerated. stream-horizon.com Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based

More information

Real Life Performance of In-Memory Database Systems for BI

Real Life Performance of In-Memory Database Systems for BI D1 Solutions AG a Netcetera Company Real Life Performance of In-Memory Database Systems for BI 10th European TDWI Conference Munich, June 2010 10th European TDWI Conference Munich, June 2010 Authors: Dr.

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

NoSQL Data Base Basics

NoSQL Data Base Basics NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS

More information

MAGENTO HOSTING Progressive Server Performance Improvements

MAGENTO HOSTING Progressive Server Performance Improvements MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents

More information

In-Memory Databases MemSQL

In-Memory Databases MemSQL IT4BI - Université Libre de Bruxelles In-Memory Databases MemSQL Gabby Nikolova Thao Ha Contents I. In-memory Databases...4 1. Concept:...4 2. Indexing:...4 a. b. c. d. AVL Tree:...4 B-Tree and B+ Tree:...5

More information

Performance and Scalability Overview

Performance and Scalability Overview Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics Platform. Contents Pentaho Scalability and

More information

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier Simon Law TimesTen Product Manager, Oracle Meet The Experts: Andy Yao TimesTen Product Manager, Oracle Gagan Singh Senior

More information

Esri ArcGIS Server 10 for VMware Infrastructure

Esri ArcGIS Server 10 for VMware Infrastructure Esri ArcGIS Server 10 for VMware Infrastructure October 2011 DEPLOYMENT AND TECHNICAL CONSIDERATIONS GUIDE Table of Contents Introduction... 3 Esri ArcGIS Server 10 Overview.... 3 VMware Infrastructure

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

MS SQL Server 2014 New Features and Database Administration

MS SQL Server 2014 New Features and Database Administration MS SQL Server 2014 New Features and Database Administration MS SQL Server 2014 Architecture Database Files and Transaction Log SQL Native Client System Databases Schemas Synonyms Dynamic Management Objects

More information

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

A survey of big data architectures for handling massive data

A survey of big data architectures for handling massive data CSIT 6910 Independent Project A survey of big data architectures for handling massive data Jordy Domingos - jordydomingos@gmail.com Supervisor : Dr David Rossiter Content Table 1 - Introduction a - Context

More information

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE 1 W W W. F U S I ON I O.COM Table of Contents Table of Contents... 2 Executive Summary... 3 Introduction: In-Memory Meets iomemory... 4 What

More information

SAP HANA. SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence

SAP HANA. SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence SAP HANA SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence SAP HANA Performance Table of Contents 3 Introduction 4 The Test Environment Database Schema Test Data System

More information

Memory-Centric Database Acceleration

Memory-Centric Database Acceleration Memory-Centric Database Acceleration Achieving an Order of Magnitude Increase in Database Performance A FedCentric Technologies White Paper September 2007 Executive Summary Businesses are facing daunting

More information

Innovative technology for big data analytics

Innovative technology for big data analytics Technical white paper Innovative technology for big data analytics The HP Vertica Analytics Platform database provides price/performance, scalability, availability, and ease of administration Table of

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &

More information

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...

More information

Raima Database Manager Version 14.0 In-memory Database Engine

Raima Database Manager Version 14.0 In-memory Database Engine + Raima Database Manager Version 14.0 In-memory Database Engine By Jeffrey R. Parsons, Senior Engineer January 2016 Abstract Raima Database Manager (RDM) v14.0 contains an all new data storage engine optimized

More information

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc. Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE

More information

Microsoft Analytics Platform System. Solution Brief

Microsoft Analytics Platform System. Solution Brief Microsoft Analytics Platform System Solution Brief Contents 4 Introduction 4 Microsoft Analytics Platform System 5 Enterprise-ready Big Data 7 Next-generation performance at scale 10 Engineered for optimal

More information

The Sierra Clustered Database Engine, the technology at the heart of

The Sierra Clustered Database Engine, the technology at the heart of A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel

More information

Quantcast Petabyte Storage at Half Price with QFS!

Quantcast Petabyte Storage at Half Price with QFS! 9-131 Quantcast Petabyte Storage at Half Price with QFS Presented by Silvius Rus, Director, Big Data Platforms September 2013 Quantcast File System (QFS) A high performance alternative to the Hadoop Distributed

More information

Performance Management in Big Data Applica6ons. Michael Kopp, Technology Strategist @mikopp

Performance Management in Big Data Applica6ons. Michael Kopp, Technology Strategist @mikopp Performance Management in Big Data Applica6ons Michael Kopp, Technology Strategist NoSQL: High Volume/Low Latency DBs Web Java Key Challenges 1) Even Distribu6on 2) Correct Schema and Access paperns 3)

More information

PERFORMANCE TIPS FOR BATCH JOBS

PERFORMANCE TIPS FOR BATCH JOBS PERFORMANCE TIPS FOR BATCH JOBS Here is a list of effective ways to improve performance of batch jobs. This is probably the most common performance lapse I see. The point is to avoid looping through millions

More information

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

More information

SQL Anywhere 12 New Features Summary

SQL Anywhere 12 New Features Summary SQL Anywhere 12 WHITE PAPER www.sybase.com/sqlanywhere Contents: Introduction... 2 Out of Box Performance... 3 Automatic Tuning of Server Threads... 3 Column Statistics Management... 3 Improved Remote

More information

SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here

SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here PLATFORM Top Ten Questions for Choosing In-Memory Databases Start Here PLATFORM Top Ten Questions for Choosing In-Memory Databases. Are my applications accelerated without manual intervention and tuning?.

More information

White Paper November 2015. Technical Comparison of Perspectium Replicator vs Traditional Enterprise Service Buses

White Paper November 2015. Technical Comparison of Perspectium Replicator vs Traditional Enterprise Service Buses White Paper November 2015 Technical Comparison of Perspectium Replicator vs Traditional Enterprise Service Buses Our Evolutionary Approach to Integration With the proliferation of SaaS adoption, a gap

More information

Performance Baseline of Hitachi Data Systems HUS VM All Flash Array for Oracle

Performance Baseline of Hitachi Data Systems HUS VM All Flash Array for Oracle Performance Baseline of Hitachi Data Systems HUS VM All Flash Array for Oracle Storage and Database Performance Benchware Performance Suite Release 8.5 (Build 131015) November 2013 Contents 1 System Configuration

More information

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform Page 1 of 16 Table of Contents Table of Contents... 2 Introduction... 3 NoSQL Databases... 3 CumuLogic NoSQL Database Service...

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing An Alternative Storage Solution for MapReduce Eric Lomascolo Director, Solutions Marketing MapReduce Breaks the Problem Down Data Analysis Distributes processing work (Map) across compute nodes and accumulates

More information

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering DELL RAID PRIMER DELL PERC RAID CONTROLLERS Joe H. Trickey III Dell Storage RAID Product Marketing John Seward Dell Storage RAID Engineering http://www.dell.com/content/topics/topic.aspx/global/products/pvaul/top

More information

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part: Cloud (data) management Ahmed Ali-Eldin First part: ZooKeeper (Yahoo!) Agenda A highly available, scalable, distributed, configuration, consensus, group membership, leader election, naming, and coordination

More information

Scala Storage Scale-Out Clustered Storage White Paper

Scala Storage Scale-Out Clustered Storage White Paper White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform:

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform: Creating an Integrated, Optimized, and Secure Enterprise Data Platform: IBM PureData System for Transactions with SafeNet s ProtectDB and DataSecure Table of contents 1. Data, Data, Everywhere... 3 2.

More information

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!)

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!) MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!) Erdélyi Ernő, Component Soft Kft. erno@component.hu www.component.hu 2013 (c) Component Soft Ltd Leading Hadoop Vendor Copyright 2013,

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance. Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

Increasing Flash Throughput for Big Data Applications (Data Management Track)

Increasing Flash Throughput for Big Data Applications (Data Management Track) Scale Simplify Optimize Evolve Increasing Flash Throughput for Big Data Applications (Data Management Track) Flash Memory 1 Industry Context Addressing the challenge A proposed solution Review of the Benefits

More information