ScaleDB Managing Streams of Time Series Data
|
|
- Louisa Nash
- 7 years ago
- Views:
Transcription
1 ScaleDB Managing Streams of Time Series Data
2 Using ScaleDB to Manage Streams of Time Series Data Abstract Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams which need to be analyzed. Streaming data can be considered as one of the main sources of Big Data. However, mining Big Data streams faces two principal challenges: Velocity the frequency of data generation or frequency of data delivery. And Volume the amount of data accumulated. As many developers found traditional databases lacking support for Big Data, specialized solutions were introduced. Currently, most Big Data projects are using Hadoop. However, Hadoop is a batch process that mainly addresses the Volume challenge. To address the Velocity challenge, different architectural solutions are tailored around Hadoop which are referred to as the Lambda architectures. Examples are Storm, Druid and Cassandra that are set to process incoming data prior to processing the data in Hadoop. For time series data we propose a different approach. ScaleDB keeps the data in the database making the solution SQL based, simple to develop and deploy and cost effective to operate and maintain. Scaling is done similar to Hadoop by adding nodes to a cluster, however, unlike Hadoop, insertion is done in next to real time supporting Velocity that can exceed millions of inserts per second. In particular, performance remains constant when data volume grows. ScaleDB is disk based, offering TCO advantages over in-memory based solutions. ScaleDB leverages the entire MySQL ecosystem of tools, people and process, avoiding the extremely complex Lambda deployment. Lastly, ScaleDB is row based, providing a unified and integrated solution to manage streaming, operational and historical data. ScaleDB is a platform of database and storage nodes connected by a network. These connected machines form a cluster of database and storage nodes that processes data in 2 tiers: a database tier and a storage tier. The storage tier provides a shared data container for the database nodes in the database tier. The overall cluster is a general purpose, shared disk, row based database for Big Data. It includes special optimizations to provide high insertion rates while supporting BI queries that evaluate data by time. The BI queries are pushed to the storage nodes such that they are satisfied with a high degree of parallelism. Our experience and performance studies based on standard commodity hardware show over a million inserts per second while concurrent BI queries evaluate hundreds of millions of rows per second with next to linear scalability achieved by adding database nodes and storage nodes to the cluster. 1. Introduction Queries that evaluate data by time specify a time interval in which a property or a behavior needs to be determined and monitored. These types of queries are frequently used to support a variety of businesses requirements. Examples include: determining sales trends, fraud detection, monitoring user s behaviors based on click streams and machine learning. A new and growing market is the Internet of Things where devices generate enormous amounts of data that need to be evaluated over time. Applications that need to analyze data to satisfy these types of queries are challenged with huge amounts of data, high velocity data and requirements for real time analysis. In addition, many applications not only require the summary information, they are also interested in the source data. For example, in a security application, the same data set that alerts on a potential security break by analyzing the number of entries to a particular IP and port may be required to present the details of the security event. To trace the root cause, a troubleshooter may need to know all the source and destination IPs and ports interacted with over a given time. ScaleDB processes are based on source data and therefore can satisfy queries on individual rows. ScaleDB is using MySQL/MariaDB as the interface making it simple to deploy and manage and fully compatible with the entire MySQL ecosystem. ScaleDB provides the best price performance solution for processing and analyzing time-series data. Table of Contents: Section 2 The ScaleDB System Overview Section 3 Data Organization Section 4 Local Time Based Indexes Section 5 Global Hash Indexes ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 1
3 Section 6 Query Push Down Mechanism Section 7 Lock Manager Section 8 Experimental Results: Section 9 Conclusions 2. The ScaleDB System Overview ScaleDB provides a unified approach to manage data streams, historical data and operational data. Tables are defined and data is processed using SQL. Scaling is achieved by adding nodes to a cluster without the need to shard/partition the data. As it will be explained and demonstrated, insert and query scaling is linear by adding nodes to the cluster and data volumes have no impact on performance. ScaleDB is a disk based solution - the amount of memory (RAM) required is orders of magnitude smaller than the total size of the data managed. This approach provides a much better TCO than alternative memory based DBMS solutions. This document will show how ScaleDB manages time series data very efficiently by overcoming insertion and query row retrieval performance barriers of standard disk-based solutions. ScaleDB is a relational, transactional, shared disk, clustered database using 3 major software components: a) Storage s - These nodes maintain the data. The data is organized in rows which are contained in disk based blocks. Frequently used blocks are maintained in a cache layer and less frequently used blocks are read from disk. With multiple storage nodes, the data is striped over the different nodes such that every node manages a portion of the data. b) Database s - These nodes receive and process low level logical requests to manipulate the data. For example, a request to insert a new row or a request to retrieve a row by a particular key. This functionality is triggered by a call to a ScaleDB native API. Each database node is able to read and write to the entire data set (read from and write to each storage node). When a database node processes data, it reads the data to its local cache from the storage node that maintains the data and when a transaction commits, the database node transfers the data that was updated to the storage tier. ScaleDB does not provide a SQL parser but offers an interface to MariaDB - MariaDB treats ScaleDB as a native storage engine and is not aware of the cluster. The cluster level processes are managed by ScaleDB. c) Lock Manager. As all the data is available to all the database nodes in the cluster, the lock manager provides a locking mechanism that synchronizes lock requests (of the shared resources) of different database nodes in the cluster. The logic and performance considerations for the lock manager are explained in section 6 below. These nodes are connected by a network and provide a 2 tier platform: 1. The database tier is a collection of database nodes with MariaDB as the user interface and ScaleDB as the database engine. 2. The storage tier is a collection of storage nodes. Each node is configured and tuned to satisfy IO requests from the database nodes. The storage tier presents a complete and consistent view of the entire data set to each of the database nodes in the cluster then each database node presents a complete and consistent view of the data to the application. ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 2
4 DBMS DBMS DBMS Cluster Manager Storage Storage Storage Storage A ScaleDB Cluster HA of the database tier - since data is shared among all the database nodes, if a database node fails, queries can be routed to any of the surviving nodes. HA of the distributed lock manager is done by maintaining a standby lock manager that manages the cluster if the active lock manager fails. To explain how HA is maintained in the storage tier we define a concept of a volume. A volume is a logical container that maintains a portion of the data. For example, with 10 volumes the data is stripped over the 10 volumes such that every volume has one tenth of the data. Practically, a volume can be supported by one storage node. However, to provide HA, a volume is supported by 2 or more nodes and when data is sent to be written in a volume, it is written to all the nodes that support the volume. Therefore, if a storage node fails, the data in a particular volume remains available by a different node in the volume and the system continues to run. For regular tables, ScaleDB is using general purpose indexes which are based on unique, balanced, Patricia Trie structures. These were detailed in These indexes are global indexes across the cluster. That means that these indexes are treated as shared resources by the different database nodes in the cluster. In this paper we detail a new approach to manage Big Data in a clustered database with special optimizations for time series data. These optimizations allow high insertion rate, efficient query of rows by key values and efficient queries that filter and group data. Our method defines a special table in the database. The table is called Streaming Table and it has the following characteristics: 1. The primary key is a unique number which is automatically generated. 2. One of the columns in the table is a time stamp which is automatically generated and represents the insert time of each row in the database. 3. The data of the table is stripped over multiple storage nodes in the storage tier. 4. Each storage node maintains a local index over the data by the time stamps. 5. In every storage node, the data is sequentially organized by time. 6. The global indexes are based on hash structures. 7. Queries that evaluate data within a time interval leverage a pushdown mechanism. 3. Data Organization ScaleDB operates multiple database instances on different servers in the cluster against a shared set of data files. Each data file spans multiple hardware systems and yet appears as a single unified dataset to the database node. This enables the utilization of commodity hardware to reduce total cost of ownership and to provide a scalable computing environment that supports massive data volumes. ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 3
5 Each database instance considers the data as organized in tables whereas each table organizes the data in uniform rows. Logically, tables appear to the database instances as a continues set of rows. However, the physical implementation is different to support performance scalability and HA. ScaleDB places the data in volumes. A volume is a logical unit that maintains data. With N volumes, each volume manages 1/N of the data. A volume is supported by one or more storage nodes which are commodity machines (or virtual machines). When data is written, it is written to all the nodes assigned to the volume. If a volume is supported by 2 nodes, the data is written twice and therefore the system continues to run if a storage node fails. When a row is written, it is placed in a disk based block and each block is shipped to one volume in the storage tier. As volumes are randomly picked, ScaleDB achieves an even distribution of the data among the volumes and avoids hotspots. In each storage node, the blocks are organized sequentially in one or more files and these files are dynamically mapped to the logical tables that are considered and processed by the database nodes. Similarly, a table may be supported by indexes that are organized in disk based blocks and are randomly placed in the different volumes. The outcome of this approach is that the data is stripped across multiple volumes and the IO operations are efficient as they leverage all the resources (CPU, memory, disks) of all the nodes on the storage tier. Scaling of the storage tier is simple as it only requires additional nodes. However, for high volume and velocity with disk based data, a traditional indexing approach may still be insufficient. If we expect hundreds of thousands or more inserts per second, a disk based solution that requires random IOs will not be able to support the insert rate efficiently. For queries, if rows are retrieved by an index from disk, even a single random IO per row would make the query slow when millions of rows are retrieved (1,000,000 reads assuming 4ms per IO is more than an hour). ScaleDB uses 2 techniques to support huge insertion rates and queries that evaluate large data sets. The first is a special organization of the data and the second is leveraging high degree of parallelism. In order to avoid the random seeks, time series data is organized by time such that queries by time read data sequentially. This organization allows for efficient retrieval of rows in a given time range. If the first block of a time range is available (see section 4 below), than all the rows of the time range can be efficiently retrieved by sequentially reading consecutive blocks. Therefore, when rows are retrieved by time, a sequential scan replaces the random seek of a conventional index. As the data of a table is evenly distributed on all the volumes, the scan (and sometimes the query see section 6) is pushed to all the volumes such that multiple machines concurrently retrieve the data. This approach provides high degree of parallelism. The degree of parallelism can be increased by adding volumes to the cluster - The additional nodes provide additional resources and increase parallelism such that query performance for time series data can be set to meet any performance requirement. 4. Local Storage Indexes When a block is written to a storage node, the block is indexed such that it is possible to locate the block by the time stamps represented in the time column of the rows contained in the block. A search for data by time is processed by each storage node independently and aggregated by the database node that initiated the query (see section 6). The process in each storage node is done in 2 steps: locating the first block with the needed time stamp and retrieving the rest of the data sequentially. The search ends when the data examined is outside the needed time range. This approach scales linearly by adding storage nodes to the cluster. In addition, with this approach, the data volume does not impact the retrieval time. 5. Global Cluster (hash based) Indexes The local indexes (in section 4) support queries that evaluate data by time. However, many applications require other views to the data. For example, a query may need to view the rows of a particular device, or all the rows relating to a particular customer. These queries are supported by the Global Indexes. For a streaming table the global indexes are hash based. These indexes replaced our general purpose indexes (that are used to index a non-streaming table) as they have a very small impact on the insertion time. With a standard tree based index, every insert and search required analyzing and synchronizing multiple index blocks (root to leaf) and (in the case of insert) updating the leaf block of the ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 4
6 index. The hash indexes are simpler and faster and are processed with less contention (between nodes of the cluster). With the hash based indexes, inserts are done at a high rate while providing different views over the data. 6. The Query Pushdown Mechanism The ScaleDB Pushdown Mechanism enables certain types of query processing to be done in the storage nodes. With Pushdown technology, the database nodes send query details to the storage nodes. With this information, the storage nodes can take over a large portion of the data-intensive query processing. The Pushdown Mechanism can search storage nodes with added intelligence about the query and send only the relevant bytes, not all the database blocks, to the database nodes. The performance benefits of this approach are as follows: 1. Parallel processing: every query is supported by multiple storage nodes. With 10 volumes, 10 machines execute the query. Each machine contributes resources (CPU, memory) to support the query and all machines work in parallel. 2. Rather than shipping large amount of data to the database node, the query is executed next to the data and only a small amount of data is passed over the network to the database node. 3. Query performance can increase (linearly) by adding storage nodes to the cluster. 7. The lock manager With a non-clustered database, lock conflicts between different processes are resolved by a lock manager. The different threads that operate over the data use a structure in a shared memory space that represents the locking status, pending requests and a process that resolves conflicts. We call the lock manager that operates within a shared memory space a local lock manager. However, a local lock manager is not sufficient to resolve conflicts between processes on different database nodes. As these processes do not have a shared memory, synchronization is done by messages that are sent to and from a distributed lock manager. As messaging is orders of magnitude slower than updating a memory structure, a special locking process was developed to overcome the performance challenges. We call this process Lock Taken. In a Lock Taken process, a node (in the clustered environment) is able to identify that there are no conflicting requests on a shared resource and issue an asynchronous lock request to the lock manager. This message informs the lock manager that a lock over a particular resource was taken and triggers a process in the distributed lock manager that updates its internal structures to represent that a particular node is holding a lock on the particular resource. The Lock Taken messages are different from Lock Request messages as they are performed asynchronously. This approach eliminates the performance penalties of the lock requests which require a reply by the distributed lock manager. 8. Experimental Results: A ScaleDB cluster was configured with the following 9 machines: a) Storage s (4 total): 4 X Intel(R) Xeon(R) CPU 2.27GHz (4 core) with 23G mem b) Cluster Manager (1 total): 1 X Intel(R) Core(TM) i GHz (4 core) with 32G mem c) DB nodes (4 total): 3 X Intel(R) Core(TM) i7 CPU 3.07GHz (4 core) with 5g mem 1 X Intel(R) Xeon(R) CPU 2.13GHz (4 core) with 5g mem Each storage node has 4 disks, 15k 600GB each, SAS 6GB/sec throughput, RAID 0, HP ProLiant DL380G ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 5
7 The tests used 2 tables: A streaming table that processed massive amounts of data. CREATE TABLE `payment_scaledb_streaming` ( `sale_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT `STREAMING_KEY`=Yes, `create_time` timestamp NOT NULL DEFAULT ' :00:00', `account` bigint(20) unsigned NOT NULL DEFAULT '0' `HASHKEY`=YES `HashSize`=10000, `store` char(16) NOT NULL DEFAULT '', `product` char(16) NOT NULL DEFAULT '', `coupon` char(7) NOT NULL DEFAULT '', `amount` decimal(8,2) NOT NULL, PRIMARY KEY (`sale_id`), KEY `account` (`account`), KEY `create_time` (`create_time`)) ENGINE=ScaleDB DEFAULT CHARSET=latin1 `Streaming`=YES `RangeKey`=create_time A regular table that maintains information that is joined with the data of the streaming table to satisfy queries. CREATE TABLE `stores_scaledb` ( `name` char(25) NOT NULL DEFAULT '', `street` char(25) DEFAULT NULL, `city` char(25) DEFAULT NULL, `state` char(2) DEFAULT NULL, `zipcode` char(5) DEFAULT NULL, `phone` char(12) DEFAULT NULL, PRIMARY KEY (`name`)) ENGINE=ScaleDB DEFAULT CHARSET=latin1 a) Evaluating Insert performance of a Streaming table For insert performance we considered 3 questions: 1. The insert performance provided by the ScaleDB cluster. 2. What is the impact of data volumes on the performance (would the performance remain constant as the data size increases)? 3. What is the performance factor compared to a standard Insert of a conventional dbms. ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 6
8 Rows Per Second ScaleDB - Streaming table Inserts Time(minute) InnoDB ScaleDB The Insert test shows the following: a. On this particular cluster, Insert rate was next to 1.2 million rows per second. b. Insert rate was constant as data volume grew (for the entire 4 hours). c. Compared to a single, standard DBMS (we used MariaDB with Innodb), performance started at about 100,000 rows per second but quickly degraded to about 5,000 inserts per second. b) Evaluating Query performance of a Streaming table We tested 4 different queries: Query 1 - Count of rows within a time interval. The time interval had about 70 million rows from a table with 4 billion rows. Query 2 - Retrieve rows within a time interval with a filter on some column values. The time interval had about 10 million rows from a table with 4 billion rows and 3,000 rows satisfied the filter conditions. Query 3 - Group and Analyze rows within a time interval. The time interval had about 70 million rows from a table with 4 billion rows. The group by yields 10 rows. Query 4 same as query 3 with a join on each group by row. This query is only executed to show that joins are supported. Note 1: These queries were executed while the database continues to ingest data at the same rate (1.2 million inserts/second). Note 2: We tested standard DBMS solution (MariaDB and InnoDB) executing the same query. ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 7
9 Query Time (seconds) Query Time (seconds) 60 Simple Count of 70M rows InnoDB ScaleDB Filter 10M rows to 3K rows InnoDB ScaleDB ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 8
10 Query Time (seconds) Query Time (seconds) Group and Analyze 70M rows InnoDB ScaleDB Joining Tables InnoDB ScaleDB ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 9
11 9. Conclusions ScaleDB is a 2 tier database cluster: a database tier with multiple database nodes and a storage tier with multiple storage nodes. This environment is a general purpose database platform with a distributed lock manager. The processes in the environment are tuned to support high rates of inserts (millions of rows per second) and efficient queries of time series data (hundreds of millions of rows can be evaluated per second). In the storage tier, the data is stripped into volumes and on each volume organized by time. Each storage node on the storage tier maintains a local index by time. This index locates a starting point for a search from which the time series data is retrieved sequentially. Queries evaluating time series data are pushed to be executed on multiple storage nodes allowing each query to leverage more compute resources than the resources offered by a single machine. This approach offers a huge amount of parallelism. ScaleDB creates a complete solution to manage data. Tables supporting historical and operational data can be extended to manage time series data in a highly efficient way. Data from all tables can be joined to provide a complete and consistent view of the entire data set. Developers looking to scale their databases do not need to redesign or shard their data. This approach scales by adding database and storage nodes to a cluster without the need to partition and shard the data. ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 10
Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
More informationHigh performance ETL Benchmark
High performance ETL Benchmark Author: Dhananjay Patil Organization: Evaltech, Inc. Evaltech Research Group, Data Warehousing Practice. Date: 07/02/04 Email: erg@evaltech.com Abstract: The IBM server iseries
More informationTier Architectures. Kathleen Durant CS 3200
Tier Architectures Kathleen Durant CS 3200 1 Supporting Architectures for DBMS Over the years there have been many different hardware configurations to support database systems Some are outdated others
More informationOverview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB
Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what
More informationMS SQL Performance (Tuning) Best Practices:
MS SQL Performance (Tuning) Best Practices: 1. Don t share the SQL server hardware with other services If other workloads are running on the same server where SQL Server is running, memory and other hardware
More informationBenchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
More informationWhite Paper. Optimizing the Performance Of MySQL Cluster
White Paper Optimizing the Performance Of MySQL Cluster Table of Contents Introduction and Background Information... 2 Optimal Applications for MySQL Cluster... 3 Identifying the Performance Issues.....
More informationF1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013
F1: A Distributed SQL Database That Scales Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 What is F1? Distributed relational database Built to replace sharded MySQL back-end of AdWords
More informationOptimizing Performance. Training Division New Delhi
Optimizing Performance Training Division New Delhi Performance tuning : Goals Minimize the response time for each query Maximize the throughput of the entire database server by minimizing network traffic,
More informationBig Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013
Big Data Use Case How Rackspace is using Private Cloud for Big Data Bryan Thompson May 8th, 2013 Our Big Data Problem Consolidate all monitoring data for reporting and analytical purposes. Every device
More informationMyISAM Default Storage Engine before MySQL 5.5 Table level locking Small footprint on disk Read Only during backups GIS and FTS indexing Copyright 2014, Oracle and/or its affiliates. All rights reserved.
More informationSQL Server Business Intelligence on HP ProLiant DL785 Server
SQL Server Business Intelligence on HP ProLiant DL785 Server By Ajay Goyal www.scalabilityexperts.com Mike Fitzner Hewlett Packard www.hp.com Recommendations presented in this document should be thoroughly
More informationAccelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
More informationAnalyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution
Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution Jonathan Halstuch, COO, RackTop Systems JHalstuch@racktopsystems.com Big Data Invasion We hear so much on Big Data and
More informationSQL Server 2012 Performance White Paper
Published: April 2012 Applies to: SQL Server 2012 Copyright The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication.
More informationCitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
More informationSAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationData Integrator Performance Optimization Guide
Data Integrator Performance Optimization Guide Data Integrator 11.7.2 for Windows and UNIX Patents Trademarks Copyright Third-party contributors Business Objects owns the following
More informationLambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationCloud Based Application Architectures using Smart Computing
Cloud Based Application Architectures using Smart Computing How to Use this Guide Joyent Smart Technology represents a sophisticated evolution in cloud computing infrastructure. Most cloud computing products
More informationAn Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
More informationUse of Hadoop File System for Nuclear Physics Analyses in STAR
1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources
More informationSQL Server 2008 Performance and Scale
SQL Server 2008 Performance and Scale White Paper Published: February 2008 Updated: July 2008 Summary: Microsoft SQL Server 2008 incorporates the tools and technologies that are necessary to implement
More informationArchitectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
More informationReal Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
More informationHow To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)
WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...
More informationJava DB Performance. Olav Sandstå Sun Microsystems, Trondheim, Norway Submission ID: 860
Java DB Performance Olav Sandstå Sun Microsystems, Trondheim, Norway Submission ID: 860 AGENDA > Java DB introduction > Configuring Java DB for performance > Programming tips > Understanding Java DB performance
More informationOptimizing the Performance of Your Longview Application
Optimizing the Performance of Your Longview Application François Lalonde, Director Application Support May 15, 2013 Disclaimer This presentation is provided to you solely for information purposes, is not
More informationSCALABLE DATA SERVICES
1 SCALABLE DATA SERVICES 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2 Overview MySQL Database Clustering GlusterFS Memcached 3 Overview Problems of Data Services 4 Data retrieval
More informationBENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
More informationDirect NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle
Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle Agenda Introduction Database Architecture Direct NFS Client NFS Server
More informationHypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
More informationInfiniteGraph: The Distributed Graph Database
A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086
More informationVirtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
More informationDatabase Scalability and Oracle 12c
Database Scalability and Oracle 12c Marcelle Kratochvil CTO Piction ACE Director All Data/Any Data marcelle@piction.com Warning I will be covering topics and saying things that will cause a rethink in
More informationIn Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
More informationCisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a
More informationImprove Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database
WHITE PAPER Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive
More informationConverged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities
Technology Insight Paper Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities By John Webster February 2015 Enabling you to make the best technology decisions Enabling
More informationWelcome to Virtual Developer Day MySQL!
Welcome to Virtual Developer Day MySQL! Keynote: Developer and DBA Guide to What s New in MySQL Andrew Morgan - MySQL Product Management @andrewmorgan www.clusterdb.com 1 Program Agenda 1:00 PM Keynote:
More informationConfiguring Apache Derby for Performance and Durability Olav Sandstå
Configuring Apache Derby for Performance and Durability Olav Sandstå Database Technology Group Sun Microsystems Trondheim, Norway Overview Background > Transactions, Failure Classes, Derby Architecture
More informationGigaSpaces Real-Time Analytics for Big Data
GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and
More informationUsing an In-Memory Data Grid for Near Real-Time Data Analysis
SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses
More information2009 Oracle Corporation 1
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material,
More informationSawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices
Sawmill Log Analyzer Best Practices!! Page 1 of 6 Sawmill Log Analyzer Best Practices! Sawmill Log Analyzer Best Practices!! Page 2 of 6 This document describes best practices for the Sawmill universal
More informationAffordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale
WHITE PAPER Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale Sponsored by: IBM Carl W. Olofson December 2014 IN THIS WHITE PAPER This white paper discusses the concept
More informationEstimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010
Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010 This document is provided as-is. Information and views expressed in this document, including URL and other Internet
More informationConfiguring Apache Derby for Performance and Durability Olav Sandstå
Configuring Apache Derby for Performance and Durability Olav Sandstå Sun Microsystems Trondheim, Norway Agenda Apache Derby introduction Performance and durability Performance tips Open source database
More informationBig Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based
More informationReal Life Performance of In-Memory Database Systems for BI
D1 Solutions AG a Netcetera Company Real Life Performance of In-Memory Database Systems for BI 10th European TDWI Conference Munich, June 2010 10th European TDWI Conference Munich, June 2010 Authors: Dr.
More informationMoving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
More informationNoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
More informationMAGENTO HOSTING Progressive Server Performance Improvements
MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents
More informationIn-Memory Databases MemSQL
IT4BI - Université Libre de Bruxelles In-Memory Databases MemSQL Gabby Nikolova Thao Ha Contents I. In-memory Databases...4 1. Concept:...4 2. Indexing:...4 a. b. c. d. AVL Tree:...4 B-Tree and B+ Tree:...5
More informationPerformance and Scalability Overview
Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics Platform. Contents Pentaho Scalability and
More informationTips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier
Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier Simon Law TimesTen Product Manager, Oracle Meet The Experts: Andy Yao TimesTen Product Manager, Oracle Gagan Singh Senior
More informationEsri ArcGIS Server 10 for VMware Infrastructure
Esri ArcGIS Server 10 for VMware Infrastructure October 2011 DEPLOYMENT AND TECHNICAL CONSIDERATIONS GUIDE Table of Contents Introduction... 3 Esri ArcGIS Server 10 Overview.... 3 VMware Infrastructure
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationMS SQL Server 2014 New Features and Database Administration
MS SQL Server 2014 New Features and Database Administration MS SQL Server 2014 Architecture Database Files and Transaction Log SQL Native Client System Databases Schemas Synonyms Dynamic Management Objects
More informationUsing MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com
Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A
More informationAccelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
More informationA survey of big data architectures for handling massive data
CSIT 6910 Independent Project A survey of big data architectures for handling massive data Jordy Domingos - jordydomingos@gmail.com Supervisor : Dr David Rossiter Content Table 1 - Introduction a - Context
More informationWITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE
WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE 1 W W W. F U S I ON I O.COM Table of Contents Table of Contents... 2 Executive Summary... 3 Introduction: In-Memory Meets iomemory... 4 What
More informationSAP HANA. SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence
SAP HANA SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence SAP HANA Performance Table of Contents 3 Introduction 4 The Test Environment Database Schema Test Data System
More informationMemory-Centric Database Acceleration
Memory-Centric Database Acceleration Achieving an Order of Magnitude Increase in Database Performance A FedCentric Technologies White Paper September 2007 Executive Summary Businesses are facing daunting
More informationInnovative technology for big data analytics
Technical white paper Innovative technology for big data analytics The HP Vertica Analytics Platform database provides price/performance, scalability, availability, and ease of administration Table of
More informationAdvanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
More informationComparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &
More informationHP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief
Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...
More informationRaima Database Manager Version 14.0 In-memory Database Engine
+ Raima Database Manager Version 14.0 In-memory Database Engine By Jeffrey R. Parsons, Senior Engineer January 2016 Abstract Raima Database Manager (RDM) v14.0 contains an all new data storage engine optimized
More informationOracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.
Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE
More informationMicrosoft Analytics Platform System. Solution Brief
Microsoft Analytics Platform System Solution Brief Contents 4 Introduction 4 Microsoft Analytics Platform System 5 Enterprise-ready Big Data 7 Next-generation performance at scale 10 Engineered for optimal
More informationThe Sierra Clustered Database Engine, the technology at the heart of
A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel
More informationQuantcast Petabyte Storage at Half Price with QFS!
9-131 Quantcast Petabyte Storage at Half Price with QFS Presented by Silvius Rus, Director, Big Data Platforms September 2013 Quantcast File System (QFS) A high performance alternative to the Hadoop Distributed
More informationPerformance Management in Big Data Applica6ons. Michael Kopp, Technology Strategist @mikopp
Performance Management in Big Data Applica6ons Michael Kopp, Technology Strategist NoSQL: High Volume/Low Latency DBs Web Java Key Challenges 1) Even Distribu6on 2) Correct Schema and Access paperns 3)
More informationPERFORMANCE TIPS FOR BATCH JOBS
PERFORMANCE TIPS FOR BATCH JOBS Here is a list of effective ways to improve performance of batch jobs. This is probably the most common performance lapse I see. The point is to avoid looping through millions
More informationPerformance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
More informationSQL Anywhere 12 New Features Summary
SQL Anywhere 12 WHITE PAPER www.sybase.com/sqlanywhere Contents: Introduction... 2 Out of Box Performance... 3 Automatic Tuning of Server Threads... 3 Column Statistics Management... 3 Improved Remote
More informationSAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here
PLATFORM Top Ten Questions for Choosing In-Memory Databases Start Here PLATFORM Top Ten Questions for Choosing In-Memory Databases. Are my applications accelerated without manual intervention and tuning?.
More informationWhite Paper November 2015. Technical Comparison of Perspectium Replicator vs Traditional Enterprise Service Buses
White Paper November 2015 Technical Comparison of Perspectium Replicator vs Traditional Enterprise Service Buses Our Evolutionary Approach to Integration With the proliferation of SaaS adoption, a gap
More informationPerformance Baseline of Hitachi Data Systems HUS VM All Flash Array for Oracle
Performance Baseline of Hitachi Data Systems HUS VM All Flash Array for Oracle Storage and Database Performance Benchware Performance Suite Release 8.5 (Build 131015) November 2013 Contents 1 System Configuration
More informationOn- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform
On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform Page 1 of 16 Table of Contents Table of Contents... 2 Introduction... 3 NoSQL Databases... 3 CumuLogic NoSQL Database Service...
More informationOracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
More informationAn Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing
An Alternative Storage Solution for MapReduce Eric Lomascolo Director, Solutions Marketing MapReduce Breaks the Problem Down Data Analysis Distributes processing work (Map) across compute nodes and accumulates
More informationDELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering
DELL RAID PRIMER DELL PERC RAID CONTROLLERS Joe H. Trickey III Dell Storage RAID Product Marketing John Seward Dell Storage RAID Engineering http://www.dell.com/content/topics/topic.aspx/global/products/pvaul/top
More informationAgenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:
Cloud (data) management Ahmed Ali-Eldin First part: ZooKeeper (Yahoo!) Agenda A highly available, scalable, distributed, configuration, consensus, group membership, leader election, naming, and coordination
More informationScala Storage Scale-Out Clustered Storage White Paper
White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationIBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform:
Creating an Integrated, Optimized, and Secure Enterprise Data Platform: IBM PureData System for Transactions with SafeNet s ProtectDB and DataSecure Table of contents 1. Data, Data, Everywhere... 3 2.
More informationMySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!)
MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!) Erdélyi Ernő, Component Soft Kft. erno@component.hu www.component.hu 2013 (c) Component Soft Ltd Leading Hadoop Vendor Copyright 2013,
More informationHadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
More informationAgenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.
Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance
More informationAccelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
More informationIncreasing Flash Throughput for Big Data Applications (Data Management Track)
Scale Simplify Optimize Evolve Increasing Flash Throughput for Big Data Applications (Data Management Track) Flash Memory 1 Industry Context Addressing the challenge A proposed solution Review of the Benefits
More information