ScaleDB Managing Streams of Time Series Data

Transcription

1 ScaleDB Managing Streams of Time Series Data

2 Using ScaleDB to Manage Streams of Time Series Data Abstract Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams which need to be analyzed. Streaming data can be considered as one of the main sources of Big Data. However, mining Big Data streams faces two principal challenges: Velocity the frequency of data generation or frequency of data delivery. And Volume the amount of data accumulated. As many developers found traditional databases lacking support for Big Data, specialized solutions were introduced. Currently, most Big Data projects are using Hadoop. However, Hadoop is a batch process that mainly addresses the Volume challenge. To address the Velocity challenge, different architectural solutions are tailored around Hadoop which are referred to as the Lambda architectures. Examples are Storm, Druid and Cassandra that are set to process incoming data prior to processing the data in Hadoop. For time series data we propose a different approach. ScaleDB keeps the data in the database making the solution SQL based, simple to develop and deploy and cost effective to operate and maintain. Scaling is done similar to Hadoop by adding nodes to a cluster, however, unlike Hadoop, insertion is done in next to real time supporting Velocity that can exceed millions of inserts per second. In particular, performance remains constant when data volume grows. ScaleDB is disk based, offering TCO advantages over in-memory based solutions. ScaleDB leverages the entire MySQL ecosystem of tools, people and process, avoiding the extremely complex Lambda deployment. Lastly, ScaleDB is row based, providing a unified and integrated solution to manage streaming, operational and historical data. ScaleDB is a platform of database and storage nodes connected by a network. These connected machines form a cluster of database and storage nodes that processes data in 2 tiers: a database tier and a storage tier. The storage tier provides a shared data container for the database nodes in the database tier. The overall cluster is a general purpose, shared disk, row based database for Big Data. It includes special optimizations to provide high insertion rates while supporting BI queries that evaluate data by time. The BI queries are pushed to the storage nodes such that they are satisfied with a high degree of parallelism. Our experience and performance studies based on standard commodity hardware show over a million inserts per second while concurrent BI queries evaluate hundreds of millions of rows per second with next to linear scalability achieved by adding database nodes and storage nodes to the cluster. 1. Introduction Queries that evaluate data by time specify a time interval in which a property or a behavior needs to be determined and monitored. These types of queries are frequently used to support a variety of businesses requirements. Examples include: determining sales trends, fraud detection, monitoring user s behaviors based on click streams and machine learning. A new and growing market is the Internet of Things where devices generate enormous amounts of data that need to be evaluated over time. Applications that need to analyze data to satisfy these types of queries are challenged with huge amounts of data, high velocity data and requirements for real time analysis. In addition, many applications not only require the summary information, they are also interested in the source data. For example, in a security application, the same data set that alerts on a potential security break by analyzing the number of entries to a particular IP and port may be required to present the details of the security event. To trace the root cause, a troubleshooter may need to know all the source and destination IPs and ports interacted with over a given time. ScaleDB processes are based on source data and therefore can satisfy queries on individual rows. ScaleDB is using MySQL/MariaDB as the interface making it simple to deploy and manage and fully compatible with the entire MySQL ecosystem. ScaleDB provides the best price performance solution for processing and analyzing time-series data. Table of Contents: Section 2 The ScaleDB System Overview Section 3 Data Organization Section 4 Local Time Based Indexes Section 5 Global Hash Indexes ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 1

3 Section 6 Query Push Down Mechanism Section 7 Lock Manager Section 8 Experimental Results: Section 9 Conclusions 2. The ScaleDB System Overview ScaleDB provides a unified approach to manage data streams, historical data and operational data. Tables are defined and data is processed using SQL. Scaling is achieved by adding nodes to a cluster without the need to shard/partition the data. As it will be explained and demonstrated, insert and query scaling is linear by adding nodes to the cluster and data volumes have no impact on performance. ScaleDB is a disk based solution - the amount of memory (RAM) required is orders of magnitude smaller than the total size of the data managed. This approach provides a much better TCO than alternative memory based DBMS solutions. This document will show how ScaleDB manages time series data very efficiently by overcoming insertion and query row retrieval performance barriers of standard disk-based solutions. ScaleDB is a relational, transactional, shared disk, clustered database using 3 major software components: a) Storage s - These nodes maintain the data. The data is organized in rows which are contained in disk based blocks. Frequently used blocks are maintained in a cache layer and less frequently used blocks are read from disk. With multiple storage nodes, the data is striped over the different nodes such that every node manages a portion of the data. b) Database s - These nodes receive and process low level logical requests to manipulate the data. For example, a request to insert a new row or a request to retrieve a row by a particular key. This functionality is triggered by a call to a ScaleDB native API. Each database node is able to read and write to the entire data set (read from and write to each storage node). When a database node processes data, it reads the data to its local cache from the storage node that maintains the data and when a transaction commits, the database node transfers the data that was updated to the storage tier. ScaleDB does not provide a SQL parser but offers an interface to MariaDB - MariaDB treats ScaleDB as a native storage engine and is not aware of the cluster. The cluster level processes are managed by ScaleDB. c) Lock Manager. As all the data is available to all the database nodes in the cluster, the lock manager provides a locking mechanism that synchronizes lock requests (of the shared resources) of different database nodes in the cluster. The logic and performance considerations for the lock manager are explained in section 6 below. These nodes are connected by a network and provide a 2 tier platform: 1. The database tier is a collection of database nodes with MariaDB as the user interface and ScaleDB as the database engine. 2. The storage tier is a collection of storage nodes. Each node is configured and tuned to satisfy IO requests from the database nodes. The storage tier presents a complete and consistent view of the entire data set to each of the database nodes in the cluster then each database node presents a complete and consistent view of the data to the application. ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 2

4 DBMS DBMS DBMS Cluster Manager Storage Storage Storage Storage A ScaleDB Cluster HA of the database tier - since data is shared among all the database nodes, if a database node fails, queries can be routed to any of the surviving nodes. HA of the distributed lock manager is done by maintaining a standby lock manager that manages the cluster if the active lock manager fails. To explain how HA is maintained in the storage tier we define a concept of a volume. A volume is a logical container that maintains a portion of the data. For example, with 10 volumes the data is stripped over the 10 volumes such that every volume has one tenth of the data. Practically, a volume can be supported by one storage node. However, to provide HA, a volume is supported by 2 or more nodes and when data is sent to be written in a volume, it is written to all the nodes that support the volume. Therefore, if a storage node fails, the data in a particular volume remains available by a different node in the volume and the system continues to run. For regular tables, ScaleDB is using general purpose indexes which are based on unique, balanced, Patricia Trie structures. These were detailed in These indexes are global indexes across the cluster. That means that these indexes are treated as shared resources by the different database nodes in the cluster. In this paper we detail a new approach to manage Big Data in a clustered database with special optimizations for time series data. These optimizations allow high insertion rate, efficient query of rows by key values and efficient queries that filter and group data. Our method defines a special table in the database. The table is called Streaming Table and it has the following characteristics: 1. The primary key is a unique number which is automatically generated. 2. One of the columns in the table is a time stamp which is automatically generated and represents the insert time of each row in the database. 3. The data of the table is stripped over multiple storage nodes in the storage tier. 4. Each storage node maintains a local index over the data by the time stamps. 5. In every storage node, the data is sequentially organized by time. 6. The global indexes are based on hash structures. 7. Queries that evaluate data within a time interval leverage a pushdown mechanism. 3. Data Organization ScaleDB operates multiple database instances on different servers in the cluster against a shared set of data files. Each data file spans multiple hardware systems and yet appears as a single unified dataset to the database node. This enables the utilization of commodity hardware to reduce total cost of ownership and to provide a scalable computing environment that supports massive data volumes. ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 3

5 Each database instance considers the data as organized in tables whereas each table organizes the data in uniform rows. Logically, tables appear to the database instances as a continues set of rows. However, the physical implementation is different to support performance scalability and HA. ScaleDB places the data in volumes. A volume is a logical unit that maintains data. With N volumes, each volume manages 1/N of the data. A volume is supported by one or more storage nodes which are commodity machines (or virtual machines). When data is written, it is written to all the nodes assigned to the volume. If a volume is supported by 2 nodes, the data is written twice and therefore the system continues to run if a storage node fails. When a row is written, it is placed in a disk based block and each block is shipped to one volume in the storage tier. As volumes are randomly picked, ScaleDB achieves an even distribution of the data among the volumes and avoids hotspots. In each storage node, the blocks are organized sequentially in one or more files and these files are dynamically mapped to the logical tables that are considered and processed by the database nodes. Similarly, a table may be supported by indexes that are organized in disk based blocks and are randomly placed in the different volumes. The outcome of this approach is that the data is stripped across multiple volumes and the IO operations are efficient as they leverage all the resources (CPU, memory, disks) of all the nodes on the storage tier. Scaling of the storage tier is simple as it only requires additional nodes. However, for high volume and velocity with disk based data, a traditional indexing approach may still be insufficient. If we expect hundreds of thousands or more inserts per second, a disk based solution that requires random IOs will not be able to support the insert rate efficiently. For queries, if rows are retrieved by an index from disk, even a single random IO per row would make the query slow when millions of rows are retrieved (1,000,000 reads assuming 4ms per IO is more than an hour). ScaleDB uses 2 techniques to support huge insertion rates and queries that evaluate large data sets. The first is a special organization of the data and the second is leveraging high degree of parallelism. In order to avoid the random seeks, time series data is organized by time such that queries by time read data sequentially. This organization allows for efficient retrieval of rows in a given time range. If the first block of a time range is available (see section 4 below), than all the rows of the time range can be efficiently retrieved by sequentially reading consecutive blocks. Therefore, when rows are retrieved by time, a sequential scan replaces the random seek of a conventional index. As the data of a table is evenly distributed on all the volumes, the scan (and sometimes the query see section 6) is pushed to all the volumes such that multiple machines concurrently retrieve the data. This approach provides high degree of parallelism. The degree of parallelism can be increased by adding volumes to the cluster - The additional nodes provide additional resources and increase parallelism such that query performance for time series data can be set to meet any performance requirement. 4. Local Storage Indexes When a block is written to a storage node, the block is indexed such that it is possible to locate the block by the time stamps represented in the time column of the rows contained in the block. A search for data by time is processed by each storage node independently and aggregated by the database node that initiated the query (see section 6). The process in each storage node is done in 2 steps: locating the first block with the needed time stamp and retrieving the rest of the data sequentially. The search ends when the data examined is outside the needed time range. This approach scales linearly by adding storage nodes to the cluster. In addition, with this approach, the data volume does not impact the retrieval time. 5. Global Cluster (hash based) Indexes The local indexes (in section 4) support queries that evaluate data by time. However, many applications require other views to the data. For example, a query may need to view the rows of a particular device, or all the rows relating to a particular customer. These queries are supported by the Global Indexes. For a streaming table the global indexes are hash based. These indexes replaced our general purpose indexes (that are used to index a non-streaming table) as they have a very small impact on the insertion time. With a standard tree based index, every insert and search required analyzing and synchronizing multiple index blocks (root to leaf) and (in the case of insert) updating the leaf block of the ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 4

6 index. The hash indexes are simpler and faster and are processed with less contention (between nodes of the cluster). With the hash based indexes, inserts are done at a high rate while providing different views over the data. 6. The Query Pushdown Mechanism The ScaleDB Pushdown Mechanism enables certain types of query processing to be done in the storage nodes. With Pushdown technology, the database nodes send query details to the storage nodes. With this information, the storage nodes can take over a large portion of the data-intensive query processing. The Pushdown Mechanism can search storage nodes with added intelligence about the query and send only the relevant bytes, not all the database blocks, to the database nodes. The performance benefits of this approach are as follows: 1. Parallel processing: every query is supported by multiple storage nodes. With 10 volumes, 10 machines execute the query. Each machine contributes resources (CPU, memory) to support the query and all machines work in parallel. 2. Rather than shipping large amount of data to the database node, the query is executed next to the data and only a small amount of data is passed over the network to the database node. 3. Query performance can increase (linearly) by adding storage nodes to the cluster. 7. The lock manager With a non-clustered database, lock conflicts between different processes are resolved by a lock manager. The different threads that operate over the data use a structure in a shared memory space that represents the locking status, pending requests and a process that resolves conflicts. We call the lock manager that operates within a shared memory space a local lock manager. However, a local lock manager is not sufficient to resolve conflicts between processes on different database nodes. As these processes do not have a shared memory, synchronization is done by messages that are sent to and from a distributed lock manager. As messaging is orders of magnitude slower than updating a memory structure, a special locking process was developed to overcome the performance challenges. We call this process Lock Taken. In a Lock Taken process, a node (in the clustered environment) is able to identify that there are no conflicting requests on a shared resource and issue an asynchronous lock request to the lock manager. This message informs the lock manager that a lock over a particular resource was taken and triggers a process in the distributed lock manager that updates its internal structures to represent that a particular node is holding a lock on the particular resource. The Lock Taken messages are different from Lock Request messages as they are performed asynchronously. This approach eliminates the performance penalties of the lock requests which require a reply by the distributed lock manager. 8. Experimental Results: A ScaleDB cluster was configured with the following 9 machines: a) Storage s (4 total): 4 X Intel(R) Xeon(R) CPU 2.27GHz (4 core) with 23G mem b) Cluster Manager (1 total): 1 X Intel(R) Core(TM) i GHz (4 core) with 32G mem c) DB nodes (4 total): 3 X Intel(R) Core(TM) i7 CPU 3.07GHz (4 core) with 5g mem 1 X Intel(R) Xeon(R) CPU 2.13GHz (4 core) with 5g mem Each storage node has 4 disks, 15k 600GB each, SAS 6GB/sec throughput, RAID 0, HP ProLiant DL380G ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 5

7 The tests used 2 tables: A streaming table that processed massive amounts of data. CREATE TABLE `payment_scaledb_streaming` ( `sale_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT `STREAMING_KEY`=Yes, `create_time` timestamp NOT NULL DEFAULT ' :00:00', àccount` bigint(20) unsigned NOT NULL DEFAULT '0' `HASHKEY`=YES `HashSize`=10000, `store` char(16) NOT NULL DEFAULT '', `product` char(16) NOT NULL DEFAULT '', `coupon` char(7) NOT NULL DEFAULT '', àmount` decimal(8,2) NOT NULL, PRIMARY KEY (`sale_id`), KEY àccount` (àccount`), KEY `create_time` (`create_time`)) ENGINE=ScaleDB DEFAULT CHARSET=latin1 `Streaming`=YES `RangeKey`=create_time A regular table that maintains information that is joined with the data of the streaming table to satisfy queries. CREATE TABLE `stores_scaledb` ( `name` char(25) NOT NULL DEFAULT '', `street` char(25) DEFAULT NULL, `city` char(25) DEFAULT NULL, `state` char(2) DEFAULT NULL, `zipcode` char(5) DEFAULT NULL, `phone` char(12) DEFAULT NULL, PRIMARY KEY (`name`)) ENGINE=ScaleDB DEFAULT CHARSET=latin1 a) Evaluating Insert performance of a Streaming table For insert performance we considered 3 questions: 1. The insert performance provided by the ScaleDB cluster. 2. What is the impact of data volumes on the performance (would the performance remain constant as the data size increases)? 3. What is the performance factor compared to a standard Insert of a conventional dbms. ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 6

8 Rows Per Second ScaleDB - Streaming table Inserts Time(minute) InnoDB ScaleDB The Insert test shows the following: a. On this particular cluster, Insert rate was next to 1.2 million rows per second. b. Insert rate was constant as data volume grew (for the entire 4 hours). c. Compared to a single, standard DBMS (we used MariaDB with Innodb), performance started at about 100,000 rows per second but quickly degraded to about 5,000 inserts per second. b) Evaluating Query performance of a Streaming table We tested 4 different queries: Query 1 - Count of rows within a time interval. The time interval had about 70 million rows from a table with 4 billion rows. Query 2 - Retrieve rows within a time interval with a filter on some column values. The time interval had about 10 million rows from a table with 4 billion rows and 3,000 rows satisfied the filter conditions. Query 3 - Group and Analyze rows within a time interval. The time interval had about 70 million rows from a table with 4 billion rows. The group by yields 10 rows. Query 4 same as query 3 with a join on each group by row. This query is only executed to show that joins are supported. Note 1: These queries were executed while the database continues to ingest data at the same rate (1.2 million inserts/second). Note 2: We tested standard DBMS solution (MariaDB and InnoDB) executing the same query. ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 7

9 Query Time (seconds) Query Time (seconds) 60 Simple Count of 70M rows InnoDB ScaleDB Filter 10M rows to 3K rows InnoDB ScaleDB ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 8

11 9. Conclusions ScaleDB is a 2 tier database cluster: a database tier with multiple database nodes and a storage tier with multiple storage nodes. This environment is a general purpose database platform with a distributed lock manager. The processes in the environment are tuned to support high rates of inserts (millions of rows per second) and efficient queries of time series data (hundreds of millions of rows can be evaluated per second). In the storage tier, the data is stripped into volumes and on each volume organized by time. Each storage node on the storage tier maintains a local index by time. This index locates a starting point for a search from which the time series data is retrieved sequentially. Queries evaluating time series data are pushed to be executed on multiple storage nodes allowing each query to leverage more compute resources than the resources offered by a single machine. This approach offers a huge amount of parallelism. ScaleDB creates a complete solution to manage data. Tables supporting historical and operational data can be extended to manage time series data in a highly efficient way. Data from all tables can be joined to provide a complete and consistent view of the entire data set. Developers looking to scale their databases do not need to redesign or shard their data. This approach scales by adding database and storage nodes to a cluster without the need to partition and shard the data. ScaleDB Managing Streams of Time Series Data Copyright 2015 by ScaleDB 10