Trusted Analytics Platform (TAP) TAP Technical Brief October 2015
TAP Technical Brief Overview Trusted Analytics Platform (TAP) is open source software, optimized for performance and security, that accelerates the creation of Cloud-native applications driven by Big Data Analytics. The project brings together many proven and familiar open-source components, integrating them in one vibrant communities behind these open source projects. Purpose In many cases, enterprises struggle to deliver tangible value from Big Data Analytics. The time required to adapt essential technologies and extend them into a custom solution is simply too long. These challenges span data, analytics and application development. TAP has been designed, from the ground up, to reduce these complexities. It provides a shared environment for advanced analytics on Big Data, making it easier for developers to collaborate with data scientists on both public and private Cloud infrastructures with hardware-enhanced performance and security optimizations. TAP was designed with three types of users in mind: data scientists, application developers and system operators. data scientists by eliminating many of the current limitations associated with analytics at scale and help data scientists with the most common tasks. Data Sampling Selecting representative feature/data subset Data Preparation Transform and fuse data sets Model Select/Train Manually select from multiple algorithms Model Validation Simulate at scale resolve issues Model Deployment Deploy into scoring engine, track drift Horizontal Solutions Service discovery and binding Using TAP, data scientists are able to build and train models for Big Data using familiar interfaces like ipython notebooks, RStudio, H2O Flow or Eclipse IDE. TAP provides libraries for graph, deep learning and classical machine learning algorithms. Each algorithm is open source and almost all algorithms are parallelized and can be executed on a distributed processing system, such as Apache Spark or Apache Hadoop with, in some cases, CPU and GPU hardware acceleration.
Connectors Processors Stores Analytics Message Brokers & Queues Kadka, RabbitMQ MQTT, WS, REST... Distributed stream & batch data Hadoop, Spark GearPump Optimized data stores HDFS, HBase, PostgressSQL, MySQL, Cassandra, etc... Model deployment, scoring, search, charts, dashboards Spark, Impala, H2O... Key components in TAP for working with data Application developers gain immediate access to a polyglot application platform based on open source Cloud runtimes, combined with dynamically bindable services and expressive APIs, allow application developers to greatly reduce their development time and simplify integration with all the data analytical capabilities developed by data scientist. Many common ingestion protocols, like MQTT, WS and REST as well as message queues based on Apache Kafka and RabbitMQ, are implemented in TAP and allow application developers to easily connect to data processing services, such as Apache Spark. TAP also includes native framework, based on Akka called Gearpump, which delivers distributed and low-latency data processing capabilities. A screenshot of TAP s console marketplace shows some of the available services and data stores For system operators, TAP eliminates many of the complexities normally associated with a secure and scalable Big Data platform. Using the simple administration interface and an open framework for services, system
Platform Architecture TAP is a multi-tenant platform designed to ease and accelerate the delivery of end-to-end analytical applications. Its loosely-coupled, layered architecture solution delivery. The Data Layer combines the essential components of any Big Data management system: Distributed File Systems and Distributed Processing Frameworks. Additionally, the Data Layer has a range of data stores each optimized for a particular structure: Key-Value, Relational Database Management Systems, Document-Oriented, Columnar, Graph, Time Series and others. While the number of data stores supported in TAP continues to grow, the exact composition Provisioning, Management, Monitoring Application Layer Custom Apps Built-in Apps Analytics Layer Engines/Frameworks Algorithms, Models & Pipelines Data Layer Distributed Processing Scalable Data Persistence APIs The Analytics Layer includes analytics tools and the API server that translates function calls from the analytics tools to the supporting algorithms for data wrangling and machine learning. Its plugin architecture allows system operators to expand TAP capabilities and automatically expose user-accessible APIs for newly added functions. In this layer, TAP also leverages many of the technologies from its partners, including H2O and Based on Cloud Foundry, the Application Layer exposes various runtimes, messaging frameworks or connectors for dynamic service binding, service brokers for ease of access and a service catalog to enable service discovery. Additionally, the Application Layer supports an extension construct called buildpacks. When developers publish their applications, Cloud Foundry automatically detects which buildpack is required and installs it on the Droplet Execution Agent (DEA) where the application needs to run. At every layer of the platform architecture, performance optimizations have been incorporated to maximize the speed of analytic operations. In parallel, security enhancements, from silicon up, ensure data and operations are protected.
Performance Optimizations Data Analytics Acceleration Library (DAAL). extends DAAL to persist the analytical models in its metastore to deliver a consistent end-to-end pipeline for data scientists with broad support for number of Machine Learning algorithms. TAP can also integrate Intel s Math Kernel Library (MKL). MKL accelerates math processing routines that increase application performance and reduce development time. It includes highly-vectorized and threaded Linear Algebra, Fast Fourier Transforms (FFT), Vector Math and Statistics functions. These functions automatically scale on Intel processor architectures by selecting the best code path for each processor generation. Security TAP inherits and extends the OAuth 2.0 authentication mechanisms enabled in Cloud Foundry (CF) by integrating them with the Kerberos authentication mechanisms in Hadoop. TAP integrates these two authentication frameworks using Apache Kerby, which allows a CF application to authenticate its users via OAuth2.0 and convert the OAuth token into a service ticket that is accepted by Kerberos. As a result, the user gets a single-sign on experience across the Data and Application Layers and each one of their action within TAP is performed within the contexts of that user. This deep context awareness allows TAP to audit each activity whether performed in analytical tools or within custom application. Deployment Model TAP supports two deployment models. Organizations that have already deployed Apache Hadoop in their data center (virtualized or on bare metal) can reuse the existing cluster or choose TAP to deploy it. SRV APP TAP APP DATA Hadoop CloudFoundry OTHER CloudFoundry TAP TAP IaaS Hadoop IaaS Physical Hardware Physical Hardware
From a networking standpoint, TAP deploys into two subnets. The components within a subnet are isolated and all communication are routed through trusted APIs. Control Accec ss MQ Clus ter Manager Node s Service Nodes Stat e Clus te r Wo rker No des System Operator Command Line or Browser Jump Box (SSH) Data Subnet IOT Devices Gateway or Direct Application User Web, Mobile or Desktop Data Scientist & App Builder ipython, RStudio, H 2O Proxy (MQT QTT) T) Proxy (HTT TTP) API (HTTP) App Controll er Health Manager User Services Run-time App Subnet TAP Deployment Recipes the common ways the platform can be extended. These recipes are reproducible, in many cases with minimal amount of development. Each recipe comprises of number of smaller microservices whose behaviors can be easily customized by the developer. The following diagram demonstrates one such recipe showing for hospital patient re-admission. Data Sources TCP/IP-based Protocols (REST, WS, XMPP, MQTT/AMQP/DDS) Unique Secure Endpoint HAProxy Gateway Go App Message Queue Kafka Intel Analytics Toolkit for Hadoop* Data Store HDFS Model API Re-admission Visualization & Dashboard App
TAP Adoption and Contributions TAP has been released as an open source project, including many popular and reliable components supported by a large community. The project, its code and documentation, including a Getting Started Guide with beginning steps for each user, can be found at http://trustedanalytics.github.io. To learn how a growing number of organizations support, use and are contributing to the TAP project to accelerate their Big Data Analytics application development across a broad range of use cases, visit http://.