PAXATA DATA PREPARATION PERFORMANCE BENCHMARKING SPRING 15 RELEASE

Transcription

1 PAXATA DATA PREPARATION PERFORMANCE BENCHMARKING SPRING 15 RELEASE February 2015 Page 1

2 Table of Contents Introduction... 3 Paxata Technology Stack... 3 The user interface layer... 4 Data preparation application web services... 4 Parallel in-memory pipelined data prep engine... 4 File management and storage... 5 Production Deployment... 6 Architecture... 6 Performance Metrics... 7 Criteria... 7 Results... 8 Usage... 9 Extreme Scalability... 9 Summary About Paxata Page 2

3 Introduction For the last 30 years, traditional data integration products have been IT s workhorse for processing data. Data Integration (DI), also known as ETL, is the analysis, combination, and transformation of data from a variety of sources and formats into a unified data model representation. Data Integration is a key element of data warehousing, application integration, and business analytics solutions. The variety and volume of data is always increasing and performance of data integration systems is critical. However, there has been no industry standard for measuring and comparing the performance of DI systems. The TPC-DI benchmark subcommittee is continuing refinement of the specification. While today s self-service data preparation solutions should be able to handle the same data volumes, some of the basic performance testing from legacy ETL tools are just not relevant. On the other hand, today s self-service data prep platform, built specifically for business users, has a new set of performance metrics based on the direct interaction between business analysts and the underlying system. Regardless of whether it is hosted in the cloud or deployed on premise, there is a demand for high performance and elasticity that was never expected of ETL tools because business never got to interact with them directly. This report is on the most recent tests performed on Paxata s Spring 15 release. Paxata tests all major releases based on a set of benchmarks, which were initially established in the Fall 2013 release. Details about system configuration can be found at the back of this document. Paxata Technology Stack The Paxata architecture is comprised of four layers: an HTML5 UI (User Interface), a Java web services layer, a parallel pipeline data prep engine that wraps Apache Spark with additional functionality built to optimize Spark performance and responsiveness, and a data management layer that persists data inside HDFS (Hadoop Distributed File System). This architecture and code base is leveraged for both our multi-tenant cloud service as well as an on premise deployment model. Cloud customers get the power of Paxata s robust architecture without additional cost or burden of maintenance. They simply log on and start data prep projects. On premise or private cloud customers have the ability to deploy Paxata within a dedicated Hadoop environment or as part of their existing Hadoop cluster. Page 3

4 The user interface layer HTML5 and websocket technology that ensure the system is multi-user aware. That means Paxata can be used from any web browser, from any device. That also means when someone makes a change in the system, whether in setting up a new project, adding data, publishing data or working on data sets, all authorized Paxata users in their system can see those actions being performed in real time. This component delivers a visual user experience that is symmetric across all devices, such as desktop web browsers, tablets, and smart phones. A web services toolkit (REST API) that allows for programmatic system access, as well as ODBC/JDBC connectivity that enables users to query Paxata AnswerSets via Impala or Hive. Data preparation application web services A lightweight Java layer that translates and mediates actions from the user interface into commands to the underlying platform layer. This layer handles critical capabilities for rules around tenants, users, projects and cell-level modifications, creating a comprehensive governance backbone. It also manages time-stamping and versioning for every operation performed, which is the secret sauce behind Paxata Step Editor. A lightweight instance of MongoDB is dedicated to the Paxata instance, and captures all of the application meta-data from the web services. Some customers prefer to use their own instance of MongoDB and this is completely acceptable as long as the versions are compatible. Parallel in-memory pipelined data prep engine Intellifusion is Paxata s data prep automation engine enabled by proprietary machine learning, latent semantic indexing, statistical pattern recognition and text analytics techniques. Intellifusion handles data in a model-free environment and operates over a large variety and volumes of structured and unstructured data in real-time, enabled by a vector query processor. At the core of Intellifusion is the combination of a distributed in-memory processing engine from Apache Spark with a Paxata-proprietary Spark interface that interprets requests from the web services layer and compiles them into the minimum set of operations that need to be executed on the cluster. This reduces the burden on Spark by efficiently delivering only the necessary jobs to the server. While Spark is used out-of-the-box (no modifications are made to CDH), here are Page 4

5 some of the areas where Paxata has invested significant development time to increase the efficiency and intelligence of Spark: In addition to the Resilient Distributed Datasets (RDDs) that come with Spark, Paxata developed a number of proprietary abstractions that do projections, filtering, grouping, joins). PaxRequests reduce the burden on Spark by organizing and optimizing sequences of RDD operations as part of a higher level construct for viewing data, creating clusters and aggregates, histograms, relationships and more. This layer also includes Paxata s intelligent cache management layer that allows us to invoke caches in-line (on a given node) or remotely, allowing the system to call on data cached on other nodes and produce them seamlessly to the user. File management and storage All data sets and AnswerSets are stored and accessed through the Paxata Library, which sits on top of HDFS (the Hadoop Distributed File System). For on premise or private cloud, there are two deployment options for data persistence: customers can either use an existing Hadoop cluster or create a specific Hadoop cluster for Paxata. Cloud customers get all the power of Hadoop without ever needing to think about the underlying file management and storage technologies. The virtualized, highly reliable infrastructure for our multi-tenant cloud service runs on Amazon Web Services. On premise customers can also deploy Paxata s Adaptive Data Preparation platform on VMWare VCloud environments. Page 5

6 Production Deployment Architecture Paxata s production deployment architecture in Amazon Web Services consists of the following components: Web Services and Data Library: 1 X 32 core 60 GB instance In-Memory Pipelined Data Prep Engine on Apache Spark: Between X 8 core 60GB instances (the system elastically scales) Hadoop Cluster: 8 X 4 core 30GB instances MongoDB: 3 X 1 core 3.7GB LDAP & DNS: 4 x 1 core 2GB The production deployment architecture is depicted in the diagram below: Page 6

7 Performance Metrics Criteria Paxata s Data Preparation benchmark is inspired by TPC-DI, the Data Integration (also known as 'ETL') benchmark developed by the TPC. Paxata s benchmark combines and transforms data extracted from multiple On-Line Transaction Processing (OTLP) systems along with other sources of data, and persists it into an AnswerSet that can then be sent to a variety of destinations including reporting and visualization tools, analytic applications, traditional data warehouses, or Hadoop clusters. The source and destination data models, data transformations and implementation rules have been designed to be broadly representative of modern data integration requirements, characterized by: Ingestion of large volumes of data Multiple data sources, utilizing a variety of different data formats A mixture of transformation types including data validation, key lookups, conditional logic, data type conversions, complex aggregation operations, etc. AnswerSet building and maintenance operations One extremely important difference between traditional DI benchmarking and DP benchmarking is that Paxata allows for interactive processing in addition to batch processing. Given that this is a breakthrough capability not available in legacy systems, the focus of our questions during testing were as follows: 1. Based on a changing data volume, what was the time it took to load from HDFS into Spark? 2. Visualization of filtergrams how quickly did the system return results of text filtergram on numeric data? Text filtergram on string data? Numeric filtergram on numeric data? 3. Multiple filtergrams how long did it take to select a value from a filtergram and rerender both the grid when there were multiple filtergrams? 4. Full scan operations how quickly does it take the system to sort or aggregate and groupby on a single or multiple columns? 5. Join Detection with Intellifusion how long did it take to do join detection across multiple datasets? 6. Join Execution how long do inner and various types of outer joins take to execute? 7. Shaping operations how quickly is the system able to transpose, pivot, or depivot datasets? 8. Hashing operations how quickly is the system able to create buckets based on hashing to support operations such as clustering? 9. Publishing: how quickly can the system push all of the rows of an underlying dataset through the pipeline? 10. For all of the above, what is the difference in execution time between cached and uncached operations? Page 7

8 Results The results below were tested on a cluster with 27 Spark workers (in the deployment model described in previous section) using three publicly available datasets intended to represent a prototypical business analyst data preparation use case: Dataset 1 20 million rows x 22 columns Dataset k rows x 22 columns Dataset 3 2 million rows x 198 columns Performance Comparisons of Paxata Fall 14 and Spring 15 Release Scenario Fall 14 Spring 15 % Change - Not Cached - Not Cached - Not Cached Load Dataset 1 in project % (20 million rows x 22 columns) Bring up filter for col Primary Type % Select entry Narcotics in the filter % Bring up filter on col Year % Change range to be % Close filter on Year % Close filter on Primary Type % Sort col Block % Sort col ID % Bring up cluster + edit on col Block ( % clusters) Cluster automatically % Bring up filter on col Block % Group By on col Primary Type (32 rows) with % metric Count of ID Sort col Count - ID % Transpose with Row Values = Arrest and % Column Labels = Primary Type De-duplicate on Primary Type % De-duplicate on Year % Pivot with Row Labels = Primary Type and % Column Labels = Arrest and metric Count of ID Add lookup Dataset 2 (100 k rows x 22 columns) Join Detection % Left Outer % Inner % Right Outer % Full Outer % Page 8

9 Add lookup Dataset 3 (2 million rows x 198 columns) Join Detection % Left Outer % Inner % Right Outer % Full Outer % Total Median % As can be seen in the above performance benchmark, Paxata s aggregate median performance on all operations has been reduced by over 80% in the span of two releases on uncached data. With caching enabled, upon completion of an initial operation, subsequent operations of the same type return with sub-second response times. These results above are based on modest sizes intended to show the performance improvement over releases with a stable benchmark. However, Paxata has been proven to scale at much larger volumes while retaining interactive performance. Similar tests to the above have been run on a single one billion row dataset on a 128 node cluster in Amazon. Each r3.2xlarge virtual machine had 8 CPUs, 60GB of memory and a 140GB Ephemeral disk (SSD speeds). The system was able to demonstrate random access to any window of the one billion row dataset in <10 seconds time demonstrating the power of Paxata s adaptive windowing architecture which only executes transformations lazily on subsets of the data until such time as data is published. Usage In terms of how this correlates with individual customer s usage, the table below provides some key statistics for some of Paxata s customers in our multi-tenant cloud: Tenant Projects Library Artifacts Max Row Count Median Row Count High Tech manufacturer Analytics consultancy Consumer Packaged Goods company Healthcare organization Financial Services Organization As shown above, the largest number of datasets for a given tenant is 655, while the largest number of data preparation projects is 487. Most impressively, the high tech manufacturer is preparing data of 20,000,000 rows with interactive performance. It should be noted that the usage of Paxata in on premise deployments significantly exceeds the multi-tenant cloud in terms of data volumes. Extreme Scalability The scalability of the Paxata system is directly correlated to the Apache Spark system upon which it is built. Recently, the version of Spark used by Paxata was submitted to an industry benchmark on how fast a system can sort 100 TB of data (one trillion records). Using 206 EC2 machines, Spark sorted 100 TB of data on disk in 23 minutes. In comparison, the previous world record set by Page 9

10 Hadoop MapReduce used 2100 machines and took 72 minutes. Additionally Spark was able to sort one PB of data (ten trillion records) on 190 machines in under four hours, also shattering previous records. This means that Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark s in-memory cache. The Spark cluster was able to sustain 3GB/s/node I/O activity during the map phase, and 1.1 GB/s/node network activity during the reduce phase, saturating the 10Gbps link available on these machines. Hadoop MR Record Spark Record Spark 1 PB Data Size TB 100 TB 1000 TB Elapsed Time 72 minutes 23 minutes 234 minutes # Nodes # Cores physical 6592 virtualized 6080 virtualized Cluster disk 3150 GB/s 618 GB/s 570 GB/s throughput (est.) Sort Benchmark Yes Yes No Daytona Rules Network Dedicated data center, 10Gbps Virtualized (EC2) 10Gbps network Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min Virtualized (EC2) 10Gbps network This benchmark workload is resource intensive by any measure: sorting 100 TB of data following the strict rules generates 500 TB of disk I/O and 200 TB of network I/O. Being based on Apache Spark, along with a significant number of our own performance optimizations as discussed above, it is clear that Paxata s performance is state of the art in comparison to previous generations of data preparation systems. Summary Paxata is a state of the art data preparation product with a highly innovative architecture that is extremely performant and scalable and deployed in production for more than three dozen customers today. It is the only system in the industry that can provide interactive data preparation against massive volumes, and its performance will only continuously increase based on a combination of Moore s law and planned improvements in our technology. About Paxata Paxata delivers the first purpose-built Adaptive Data Preparation solution for business analysts, data scientists, developers, data curators, and IT teams to enable the integration, cleansing, and enrichment of raw data into rich, analytic-ready data to power ad hoc, operational, predictive, and packaged analytics. Paxata partners with industry-leading big data and business intelligence solutions providers such as Cloudera, and seamlessly connects to BI tools, including Salesforce.com, Tableau, Qlik and Microsoft Excel to greatly accelerate the time to actionable business insights. To learn more, visit Page 10