Real-time Ad-hoc Analytics on S3 with MemSQL

Transcription

1 Real-time Ad-hoc Analytics on S3 with MemSQL Satish Cattamanchi 4INFO Sarvesh Gupta Tavant Technologies September, 2015 ABSTRACT Enterprises are witnessing a rapid increase in data volume with growing business volume. At the same time, there is rise in demand for immediate results and readily available metrics in a cost effective manner to take strategic decisions. This paper is intended to share key insights from our experiment with MemSQL in-memory database to solve the challenge of querying huge data volumes from Amazon S3 in real time to enable execution of interactive ad-hoc queries. The paper outlines the details of the experiment along with benchmark details with size of data set, cluster configuration, types of query and tuning methods used to reduce the response time. The initial benchmark results indicate tremendous potential in MemSQL for providing high data ingestion speeds, scalable real time database engine accessible through a familiar SQL interface. Towards the end of paper, the roadmap section highlights the possible opportunities with MemSQL. Keywords: real-time analytics, in-memory database, big data, ad-hoc query INTRODUCTION The advent of open-source Hadoop with ability to scale for massive amounts of data has democratized big data and analytic capabilities for all companies to exploit. Possibilities to process such huge volume of data have given rise to newer set of use cases, one of them being able to do it in real-time for unlocking the true business potential for relevant businesses. We being a data-driven company, play a big role enabling our teams to gain insights from multi-terabyte scale data warehouse (DW). Our use cases range from analyzing campaign performance to analyzing audience behavior to apply statistical models for proprietary insights. The underpinning of our current technical architecture is that we leverage Amazon S3 web service for our DW. This allows us to isolate our storage and compute layers. Our team writes Hive queries or Python scripts to perform ETL and analytics, output of which is transferred to MySQL for reporting. Apart from this, we also require to analyze our broader data set on S3 consisting of billions of requests and events being generated every day to equip the team with an infrastructure that enables interactive querying capabilities, real-time analytics and avoid delays associated with batch processing pipeline running on Hadoop. The time to get the results is critical to our business needs. We came across the MemSQL inmemory database which looked promising with strong industry backing and some big implementations. Together with technology and consulting company Tavant Technologies, we evaluated MemSQL as a viable option for resolving this use case. CHALLENGE With tremendous growth in the ad traffic, the server log volume skyrocketed within a few months. To keep innovating, the team needed to analyze the high volumes of data consisting of billions of data points in real-time to produce insights and do strategic planning. Although, a robust data pipeline existed using Amazon EMR service and Spark, but it was not allowing the team to take full advantage of this due to the slowness resulting from sheer volume and velocity of data. We were facing multiple challenges to collect and process billions of records with real-time interactive response. There has been a great belief in the importance of building a datadriven culture enabling as many people as possible within the company with broad access to data for discovery and analysis. However, it requires a widely learned and accepted form of data access. Traditionally, the use of SQL to connect with BI tools or for interactive analysis with dashboards has made SQL the common thread which binds all. Since, a significant portion of our data contains some form of structure that can be accessed via the SQL paradigm we wanted to find a solution with SQLlike interface with real time capabilities to meet our needs. REQUIREMENTS In the existing setup, we are running a technical stack powered by Amazon EMR service. Processing billions of records is a slow process. The response time to generate reports is anywhere between 5 15 hours. Current 1

2 infrastructure is unable to meet the performance and scalability requirements at 4INFO. The objective behind implementing this technical stack is to ensure that we can get results in near real-time. This will allow our ad ops/research team to take quick actions for tuning the campaign performance as well as provide early insight into the relevant metrics to our partners. While doing this exercise we wanted to cover some frequent simple as well as complex scenarios in order to validate the usefulness of the solution. With this aim, we selected different use cases which we run on our existing queries for any use case at desired level of granularity. Thus, it would be ideal to stay closer to the Structure Query Language (SQL) paradigm as users of our system do not belong to the data science world but ones who leverage the decades of knowledge and experience in SQL-style querying, generating enormous business value. Scalable Our business volume has grown significantly in the recent past and is expected to grow manifold from here in near future. Hence, it is important for us to design the system to be robust and scalable such that we can sustain even higher volumes of data workload without re-engineering the platform. Ease of use Similar to the requirement of ad-hoc queries, the ease of use is important to ensure the implementation and maintenance is hassle-free. This involves installation and scaling of the servers, proper documentation and support along with intuitive tools to connect and query the data. TECHNOLOGIES VISITED setup and gather the performance numbers on new solution. As per Figure 1, we have the infrastructure to support the Data mining needs. However, we want to fulfill the needs for interactive queries and data analyst queries. With the rise in popularity of SQL-on-Hadoop solutions like Impala, Presto, and HAWQ, choosing one of them could have been one path to travel. Ideally, we wanted an open-source interactive query engine that can query the S3 warehouse without having the need to spin a Hadoop cluster. After narrowing down on our objective, we brainstormed on the key factors that are going to drive the decision for us to adopt a solution. After thinking through, we came up with the following list of requirements to zero upon a technology: Figure 1 Expected response times as per use case Low latency/real-time This factor is very critical to the whole objective of having a platform to be able to get answers to the questions without waiting for hours or days. Having low latency not only improves productivity but also prevents apprehension from the urge of asking questions. Ad-hoc queries The goal of setting up this pipeline is to enable data team to perform ad-hoc queries which implies team should be able to connect and fire To solve this challenge, we explored multiple options. Few of the key initial attempts were with following technologies: Druid [4] Druid is an open-source analytics data store designed for real-time exploratory analytics on large data sets. Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data aggregation. MemSQL [1] MemSQL combines transactions and analytics for subsecond data processing and reporting. Enterprises can easily build real-time applications to instantly respond to changes in their business, make better decisions, and deliver precision insights with real-time reporting. Pivotal Gemfire [5] Pivotal GemFire is a data management platform that provides real-time, consistent access to data-intensive applications throughout widely distributed cloud architectures. Apache Drill [6] Drill is an Apache open-source SQL query engine for Big Data exploration. Drill is designed from the ground up to support high-performance analysis on the semistructured and rapidly evolving data coming from modern Big Data applications, while still providing the familiarity 2

3 and ecosystem of ANSI SQL, the industry-standard query language. With Apache Drill, we did deep dive more by installing it on the infrastructure as we were seeing lot of buzz around it. Drill includes a distributed execution environment, purpose built for large-scale data processing. Setting up drill is quite straightforward. We spun a cluster with 24 Drillbit nodes and 1 zookeeper instance. To get started with querying data, it is really simple i.e. we just created a profile pointing it to S3 and started querying data. Ingestion is intuitive as you do not need to create tables or schemas beforehand. You can use the profile and fire the query from the command line. In our exploratory attempt with Apache Drill, we were not able to achieve a good performance. Standard single table queries to our request and segment tables was consuming really long (in hours) with the data size mentioned above. We tried to contact Apache Drill as well and seemingly it is evolving for querying S3. Alternative options to explore using Apache Drill is using Hadoop cluster. However, we wanted to skip Hadoop cluster. We will still be evaluating it on a parallel track and try to get a better performance. Apart from above technologies, we also had Amazon Redshift and Presto on radar, however we were not much inclined towards them due to reasons such as dependency on Amazon stack and recommended use of Parquet file format for significant performance improvements. Why MemSQL? After making some initial attempts to look closely at the above set of technologies, we shortlisted MemSQL. There were multiple reasons behind this: The architecture and setup is quite intuitive Query syntax is ANSI SQL Strong industry backing and success stories Comcast, Pinterest and many more Being wire-compliant with MySQL as they quote, connecting to your data is seamless Blends with our data structure (gzipped flat files) and do not need transformation for data ingestion Allow reading directly from file system Can be deployed as in-house solution BENCHMARKING Every user s specific applications or workloads are often unique due to which no performance benchmark can be termed perfect. Therefore in this performance benchmark, we chose to limit variables and focus on a straightforward comparative: real-time analytics on S3. We used open-source and community editions of software with data loaded into standard file formats, while striving to keep many other factors constant. MEMSQL OVERVIEW MemSQL has a two-tiered, clustered architecture. Each instance of the MemSQL program is called a node, and runs identical software. The only difference is the role the nodes are configured to play. Aggregator nodes provide a single interface to database clients and applications. Aggregators broker SQL queries to the cluster and aggregate results. Leaf nodes store and process data. Communication between leaves and aggregators is all over standard SQL. SETUP Figure 2 MemSQL Architecture [1] To setup MemSQL, we used MemSQL cloud formation template to launch a cluster with MemSQL community edition: It can also be setup through the tar file and get up and running quickly, it is quite simple to expand and shrink the cluster using the web UI. However, cloud formation template is recommended by the MemSQL team for having a cluster configured with best practices recommendations. The cluster generator tool automatically provisions and configures a MemSQL cluster using Amazon Cloud Formation. BENCHMARK #1 DATA SET The data set comprised of three types of ad-server logs requests, events and metadata logs. We picked a canned date range to select the test data set. 3

4 The details of the data set are as follows: S.No. File Type Number of records Compressed size (in TB) Number of fields 1 Request Log 4,641,205, Event Log 126,201, Metadata Log 3,245,504, After loading the data, all the queries were run manually simulating single user query concurrency with no other activities. Performance was based upon the average of 5 runs through interactive command line without using any third party access tools. CLUSTER Option Value Number of Master Aggregator 1 Number of Child Aggregators 8 Number of Leaf nodes 20 EBS Volume Size (in gigabytes) 2x80 (SSD) EC2 Instance Type m3.2xlarge INGESTING DATA exercise, we used the command line from master aggregator to create tables and execute queries. KEY TAKEAWAYS/INSIGHTS With MemSQL, we were able to achieve significant performance improvement. We are excited to share the results as follows. INGESTION TIME For data ingestion, you need to execute the data loading process from aggregator nodes. Thus, having more child aggregator nodes helps to parallelize the loading process. We were able to load the entire data set (3 TB) within 2 hours. The rate of ingestion on each aggregator ranged between MB/sec. QUERY PERFORMANCE We have put down results for multiple queries ranging from simple aggregations querying single table to complex queries with three way joins. Query 1: Standard computation over single table select count(*) from table; MemSQL supports two types of stores, a completely inmemory row store and a disk-backed columnstore. Since our data volume was huge, storing whole data in memory would be an expensive proposition. Therefore, we chose to store bigger logs (request and metadata) in columnstore tables whereas for event log we created a rowstore table. It is important to select right keys for your table for optimum performance. For, rowstore tables select primary key to uniquely identify a record and secondary keys for columns to be filtered upon. On the other hand, columnstore lets you specify columnstore key and shard key for a more even distribution and preventing skew. Since, our data is stored on S3 we had to use MemSQL Loader, an open-source tool that lets you load sets of files from Amazon S3. MemSQL Loader also takes care of deduplication of files, parallelizing the workload, retrying files if they fail to load, and more. TOOLS TO INTERACT WITH CLUSTER One of the benefits of using MemSQL is it being wirecompliant with MySQL i.e. it is as easy as connecting to the MySQL database. Hence, you can use any mysqlclient to connect with your MemSQL cluster. For this Query 2: 3-way join (2 columnstore and 1 rowstore) We tried to achieve this result, we initially stored all 3 logs in the columnstore tables but the performance was really poor. As per the MemSQL team, 3-way joins are yet not supported for columnstore tables and are still evolving. However, a recommended approach is to restrict your 2 large volume tables in columnstore tables if meant to be used in a single query and keep the other tables in memory. Thus, we separated the less bulky event log table to a rowstore table. As a result, we were able to 4

5 query the data in quick time. Below are the numbers for the exercise: select c1, c2, c3 from table1 t1 join table2 t2 On t1.request_id = t2.request_id join table t3 On t2.request_id = t3.request_id group by c1, c2, c3; memsql> delete from metadata_log; Query OK, rows affected (8.39 sec) memsql> delete from request_log; Query OK, rows affected (10.78 sec) However, we noticed the deleting data from the rowstore memory tables take more time than their columnstore counterparts. memsql> delete from event_log; Query OK, rows affected (13.43 sec) TUNING Please note that the query also used multiple MemSQL functions like Crypto functions (MD5) and String functions (SUBSTR and CONCAT) in the select clause. Query 3: 2-way joins (1 rowstore and 1 columnstore) While querying the columnstore tables using a three way join, we noticed that it was consuming unreasonable time. This is when we came across the recommendation to ensure fewer row segment groups in a columnstore table for better performance. After inspecting the state of our columnstore tables, we found presence of a large number of row segment groups probably due to heavy workload because of which optimistic merger might not be able to catch up. Thus, we triggered the pessimistic merger manually using OPTIMIZE TABLE <tablename> command. We also connected with folks at MemSQL, who were really cooperative in resolving queries. Based on their recommendations, we did perform some changes to our queries to optimize the joins. Moreover, for the rowstore tables it is recommended to use secondary keys in addition to primary key to allow quicker filtering in the query i.e. you can add the columns to be used frequently in the where clause as your secondary keys. You can specify multiple secondary keys as well. BENCHMARK #2 After the first benchmark, we wanted to push the limits and test for a larger data to establish our faith in MemSQL. As a result, we increased the overall data volume by 6X. PURGING DATA The time required to purge the data is a critical measure as well because in production you need to utilize the pipeline for recurring use case over a sliding window. To purge the data we fired the delete statements and below are the results: DATASET S.No. File Type Number of records Compressed size (in TB) Number of fields 1 Request 29,064,598, Log 2 Event Log 142,638, Metadata Log 19,664,562, Based on the volume of dataset, we spun a new cluster for this benchmark with r3.4xlarge instances. There were a few experiments before having our final cluster to hold the data. After loading the data in MemSQL, we noticed that the column store engine compresses the data significantly. We received a 50% compression ratio for our 5

6 dataset which is a huge benefit in terms of capacity planning for our clusters in turn cost savings. QUERY PERFORMANCE CLUSTER Option Value Number of Master Aggregator 1 Number of Child Aggregators 4 Number of Leaf nodes 18 EBS Volume Size (in gigabytes) 1024 (EBS) EC2 Instance Type r3.4xlarge We spun raw Amazon EC2 machines without installing any additional software except memsql-ops and memsqlloader (only on aggregator nodes). INGESTING DATA While ingesting such huge volume of data, we used MemSQL Loader again. The data ingest speed was very impressive (included time to download the files from S3, extracting and loading them). But we faced some challenges on this front: 1. Lack of orchestration for data loading process between master and child aggregators i.e. each aggregator node is isolated from others. Thereby, to maximize the utilization of your aggregator nodes we had to manually execute load statements on these machines separately. One possible solution to this problem seems to be by using load balancer to distribute the tasks across different aggregator nodes. 2. While loading huge data volumes, as mentioned in the tuning section above, the optimistic merger process falls behind due to which several row segment groups get created for the column store table. To extract maximum benefit out of the sharding and enable faster queries, we had to trigger the pessimistic merger. The time taken to optimize one table was more than the loading time for both request table and metadata table combined. This indicates that regular loading of data is a better option to reduce the overall time to get data prepared for querying. We ran multiple tests with different MemSQL functions. One of the noticeable results was with count distinct query. This is a classic example of data going to the computation due to which calculating an exact count of distinct values (i.e. the cardinality) of a big table can consume large amounts of memory and time. We ran this query on the request table and as expected it took huge time. To improve this, MemSQL provides another aggregate function APPROX_COUNT_DISTINCT which gives a probabilistic count of the number of distinct values 6

7 in the given column or expression, using a variant of the HyperLogLog algorithm. Below are the results of the execution on the metadata table: Function Function Time Unit Response Time Type COUNT DISTINCT Aggregate Minutes 90 APPROX COUNT DISTINCT Aggregate Minutes 25 We ran the same test on the request table and following are the results Figure 3 Campaign Performance by Ads Function Function Time Unit Response Time Type COUNT DISTINCT Aggregate Minutes FAILED Due to memory constraints APPROX COUNT DISTINCT Aggregate Minutes 13 For the 3-way join query mentioned in the Benchmark #1, the response time was 7 minutes. However, loading the output into an in-memory rowstore table took 140 minutes. TABLEAU + MEMSQL After querying the data from command line, we thought it would be great to test it using a self-service BI tool like Tableau which will enable the team to visualize the patterns and generate some quick reports as well. We connected Tableau to MemSQL tables and the rendering time for the charts was within seconds. However, it is in addition to the time taken to connect to tables i.e. duration for loading the metadata by Tableau. Figure 4 Campaign Performance by Device OS (Geo) Columnstore Table: 2-6 minutes Aggregated In-memory Table: 1-2 minutes Once the tables are connected, we were able to drag measures for standard aggregation taking not more than seconds. As a next step we tried to slice and dice data by different dimensions and render various types of charts with execution time staying under a minute. In fact, most of the charts rendered within 5-20 seconds. Figure 5 Campaign Performance by Mobile Brand/Model To speed up the rendering, we collected the aggregated output of the queries in the rowstore tables in MemSQL and used them as data source for our charts. As a result of this, the charts rendered almost instantly. 7

8 MemSQL has lot more features than we could explore in the available bandwidth. The current setup is in prototyping phase but we are working closely to evaluate the use of MemSQL over a varied set of use cases for our production environments. ROADMAP This benchmark was the first of the several steps we plan to take to build the data pipeline. We started with this approach because it was relatively easy to measure and is a real use case that we encounter with our current data processing stack. However, there are various other parameters to be examined in the upcoming weeks and months, including: 1. Workflow management: This is critical in the path to productionize the use of MemSQL for this pipeline. The whole idea of allowing everyone to access the data for ad-hoc queries require automation of tasks like cluster setup, data loading and decommissioning of cluster. Figure 6 Campaign Performance by Event Type The same reports and charts were also run for the larger data set loaded in Benchmark #2. The performance was better than case 1 with in-memory aggregated table. CONCLUSION The initial results with MemSQL look very promising thus making MemSQL a strong contender for cost-effective adhoc analysis with huge data volumes. Some of the noteworthy results include high data ingestion rate and the query performance with single table queries both for column store and row store tables. We expected it to shine in with reasonable data volume, but sustaining the query times with larger data set puts it way ahead for a use case like ours. MemSQL also posts an excellent compression ratio with repeatable column data stores. One thing to consider is the use of SSD for data storage and analyzing the tradeoff between cost and performance. We did see performance improvement with the smaller dataset but would be good to analyze this aspect in depth to aid capacity planning. Moreover, for data sets in the range of 100 GB to 500 GB storing the data in memory is a viable scaling option for better throughput. With huge data sets this becomes an unreasonable bet. 2. Recovery: We want to quantify the capability and repercussions of a node failure. 3. Stress-mode: How MemSQL performs when the data starts to exceed capacity planning or when parallel queries need more memory than available on the cluster. In addition to this, keeping in view other data engineering efforts being made by our team, we plan to scale this effort and wire MemSQL together with data processing frameworks to build an end-to-end real-time data pipeline. We also went through the case studies of Pinterest and Comcast using MemSQL heavily for their real-time needs. Comcast is using MemSQL as a sink to their Storm topology for real-time processing in parallel to their batch pipeline. On the other hand, Pinterest is in process of productionizing a real-time data pipeline using Kafka, Spark and MemSQL. Their high level architecture looks as follows: The benchmarking exercise was aimed towards achieving certain results. Many other considerations apart from query performance are still to be worked upon. For example, challenges in the process of automating the end to end workflow as a regular job and features like high availability, cross data center replication and granular user permissions offered by MemSQL Enterprise Edition. 8

9 REFERENCES Figure 7 Real-time data pipeline using MemSQL at Pinterest [2] Since, we have a robust processing layer built on Spark being employed for machine learning and other analytical problems, we plan to continue utilize it and connect it to our MemSQL cluster for ad-hoc analytics. MemSQL recently introduced a connector to Spark, which will allow us to expand the analytics capabilities of MemSQL with the full range of Spark tools and libraries. This architecture (Figure 8) can help exploit MemSQL as a real-time database for business operations and executing complex queries. At the same time, our data science team can manipulate and explore data using Spark and write back the results to MemSQL tables for consuming the output. 1. MemSQL Docs: and 2. Real-time analytics at Pinterest (Official Blog): /real-time-analytics-at-pinterest 3. Real-time Stream Processing Architecture: 4. Druid Website: 5. Pivotal Gemfire Overview from official docs: ing_started/topics/gemfire_overview.html 6. Apache Drill docs: 7. Extending MemSQL Analytics with Spark: AUTHORS Satish Cattamanchi is Team Lead at 4INFO and has been an integral part of the Engineering team. He is consistently working on designing and implementing comprehensive systems as part of the Ad Platform using plethora of technologies. He likes playing chess. Sarvesh Gupta has extensive experience in designing and implementing analytics solutions, while being actively involved in cutting-edge technology projects. He has exposure to multiple domains as part of Media and Entertainment practice primarily advertising. When not coding, he enjoys reading, movies and music. He can be DEFINITIONS, ACRONYMS, ABBREVIATIONS ACKNOWLEDGMENTS Figure 8 Spark with MemSQL We would like to acknowledge the responsive MemSQL team to help us resolve our queries and also suggest ways to improvise. BI: Business Intelligence SQL: Structured Query Language S3: Simple Storage Service 9