How To Write A Trusted Analytics Platform (Tap)



Similar documents
IBM Bluemix. The Digital Innovation Platform. Simon

QuickSpecs. HP Helion Development Platform. Overview

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

Dell* In-Memory Appliance for Cloudera* Enterprise

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Hadoop & Spark Using Amazon EMR

CLOUD TECH SOLUTION AT INTEL INFORMATION TECHNOLOGY ICApp Platform as a Service

Upcoming Announcements

Big Data Management and Security

Unified Batch & Stream Processing Platform

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Databricks. A Primer

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Databricks. A Primer

From Spark to Ignition:

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

ANALYTICS CENTER LEARNING PROGRAM

Interactive data analytics drive insights

SAP HANA Cloud Platform. Technical Overview Uwe Heinz

Sentinet for BizTalk Server SENTINET

SEIZE THE DATA SEIZE THE DATA. 2015

Private Cloud Management

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Towards Smart and Intelligent SDN Controller

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Zend and IBM: Bringing the power of PHP applications to the enterprise

Creating Big Data Applications with Spring XD

Open Source for Cloud Infrastructure

HYBRID CLOUD SUPPORT FOR LARGE SCALE ANALYTICS AND WEB PROCESSING. Navraj Chohan, Anand Gupta, Chris Bunch, Kowshik Prakasam, and Chandra Krintz

EMC Enterprise Hybrid Cloud 2.5, Federation Software-Defined Data Center Edition

Safe Harbor Statement

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Cisco Integration Platform

What s Happening to the Mainframe? Mobile? Social? Cloud? Big Data?

Using Patterns with WMBv8 and IIBv9

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

WEBAPP PATTERN FOR APACHE TOMCAT - USER GUIDE

ORACLE MOBILE APPLICATION FRAMEWORK DATA SHEET

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Delivering secure, real-time business insights for the Industrial world

CRITEO INTERNSHIP PROGRAM 2015/2016

Case Study : 3 different hadoop cluster deployments

Hadoop Ecosystem B Y R A H I M A.

YARN Apache Hadoop Next Generation Compute Platform

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

TRAINING PROGRAM ON BIGDATA/HADOOP

CF & IoT Protocol Support

API Management: Powered by SOA Software Dedicated Cloud

Pluribus Netvisor Solution Brief

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Scalable Architecture on Amazon AWS Cloud

BIG DATA TRENDS AND TECHNOLOGIES

Accelerating Enterprise Big Data Success. Tim Stevens, VP of Business and Corporate Development Cloudera

Oracle Big Data SQL Technical Update

Xeon+FPGA Platform for the Data Center

PROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015

How to choose the right PaaS Platform?

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Hadoop in the Hybrid Cloud

Professional Hadoop Solutions

CitusDB Architecture for Real-Time Big Data

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

The Future of Data Management

Apache Sentry. Prasad Mujumdar

PaaS - Platform as a Service Google App Engine

How Solace Message Routers Reduce the Cost of IT Infrastructure

Integrating Mobile apps with your Enterprise

WELCOME TO Open Source Enterprise Architecture

NephOS A Licensed End-to-end IaaS Cloud Software Stack for Enterprise or OEM On-premise Use.

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Dominik Wagenknecht Accenture

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

Cloud Security with Stackato

NoSQL Data Base Basics

Actian SQL in Hadoop Buyer s Guide

Cloud computing - Architecting in the cloud

Qlik Sense Enabling the New Enterprise

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

ORACLE MOBILE SUITE. Complete Mobile Development Solution. Cross Device Solution. Shared Services Infrastructure for Mobility

Introduction to WebSphere Process Server and WebSphere Enterprise Service Bus

Unlocking the True Value of Hadoop with Open Data Science

Building the Internet of Things Jim Green - CTO, Data & Analytics Business Group, Cisco Systems

HDP Hadoop From concept to deployment.

BIG DATA ANALYTICS For REAL TIME SYSTEM

Apigee Insights Increase marketing effectiveness and customer satisfaction with API-driven adaptive apps

Information Architecture

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

SIMPLE MACHINE HEURISTIC INTELLIGENT AGENT FRAMEWORK

Transcription:

Trusted Analytics Platform (TAP) TAP Technical Brief October 2015

TAP Technical Brief Overview Trusted Analytics Platform (TAP) is open source software, optimized for performance and security, that accelerates the creation of Cloud-native applications driven by Big Data Analytics. The project brings together many proven and familiar open-source components, integrating them in one vibrant communities behind these open source projects. Purpose In many cases, enterprises struggle to deliver tangible value from Big Data Analytics. The time required to adapt essential technologies and extend them into a custom solution is simply too long. These challenges span data, analytics and application development. TAP has been designed, from the ground up, to reduce these complexities. It provides a shared environment for advanced analytics on Big Data, making it easier for developers to collaborate with data scientists on both public and private Cloud infrastructures with hardware-enhanced performance and security optimizations. TAP was designed with three types of users in mind: data scientists, application developers and system operators. data scientists by eliminating many of the current limitations associated with analytics at scale and help data scientists with the most common tasks. Data Sampling Selecting representative feature/data subset Data Preparation Transform and fuse data sets Model Select/Train Manually select from multiple algorithms Model Validation Simulate at scale resolve issues Model Deployment Deploy into scoring engine, track drift Horizontal Solutions Service discovery and binding Using TAP, data scientists are able to build and train models for Big Data using familiar interfaces like ipython notebooks, RStudio, H2O Flow or Eclipse IDE. TAP provides libraries for graph, deep learning and classical machine learning algorithms. Each algorithm is open source and almost all algorithms are parallelized and can be executed on a distributed processing system, such as Apache Spark or Apache Hadoop with, in some cases, CPU and GPU hardware acceleration.

Connectors Processors Stores Analytics Message Brokers & Queues Kadka, RabbitMQ MQTT, WS, REST... Distributed stream & batch data Hadoop, Spark GearPump Optimized data stores HDFS, HBase, PostgressSQL, MySQL, Cassandra, etc... Model deployment, scoring, search, charts, dashboards Spark, Impala, H2O... Key components in TAP for working with data Application developers gain immediate access to a polyglot application platform based on open source Cloud runtimes, combined with dynamically bindable services and expressive APIs, allow application developers to greatly reduce their development time and simplify integration with all the data analytical capabilities developed by data scientist. Many common ingestion protocols, like MQTT, WS and REST as well as message queues based on Apache Kafka and RabbitMQ, are implemented in TAP and allow application developers to easily connect to data processing services, such as Apache Spark. TAP also includes native framework, based on Akka called Gearpump, which delivers distributed and low-latency data processing capabilities. A screenshot of TAP s console marketplace shows some of the available services and data stores For system operators, TAP eliminates many of the complexities normally associated with a secure and scalable Big Data platform. Using the simple administration interface and an open framework for services, system

Platform Architecture TAP is a multi-tenant platform designed to ease and accelerate the delivery of end-to-end analytical applications. Its loosely-coupled, layered architecture solution delivery. The Data Layer combines the essential components of any Big Data management system: Distributed File Systems and Distributed Processing Frameworks. Additionally, the Data Layer has a range of data stores each optimized for a particular structure: Key-Value, Relational Database Management Systems, Document-Oriented, Columnar, Graph, Time Series and others. While the number of data stores supported in TAP continues to grow, the exact composition Provisioning, Management, Monitoring Application Layer Custom Apps Built-in Apps Analytics Layer Engines/Frameworks Algorithms, Models & Pipelines Data Layer Distributed Processing Scalable Data Persistence APIs The Analytics Layer includes analytics tools and the API server that translates function calls from the analytics tools to the supporting algorithms for data wrangling and machine learning. Its plugin architecture allows system operators to expand TAP capabilities and automatically expose user-accessible APIs for newly added functions. In this layer, TAP also leverages many of the technologies from its partners, including H2O and Based on Cloud Foundry, the Application Layer exposes various runtimes, messaging frameworks or connectors for dynamic service binding, service brokers for ease of access and a service catalog to enable service discovery. Additionally, the Application Layer supports an extension construct called buildpacks. When developers publish their applications, Cloud Foundry automatically detects which buildpack is required and installs it on the Droplet Execution Agent (DEA) where the application needs to run. At every layer of the platform architecture, performance optimizations have been incorporated to maximize the speed of analytic operations. In parallel, security enhancements, from silicon up, ensure data and operations are protected.

Performance Optimizations Data Analytics Acceleration Library (DAAL). extends DAAL to persist the analytical models in its metastore to deliver a consistent end-to-end pipeline for data scientists with broad support for number of Machine Learning algorithms. TAP can also integrate Intel s Math Kernel Library (MKL). MKL accelerates math processing routines that increase application performance and reduce development time. It includes highly-vectorized and threaded Linear Algebra, Fast Fourier Transforms (FFT), Vector Math and Statistics functions. These functions automatically scale on Intel processor architectures by selecting the best code path for each processor generation. Security TAP inherits and extends the OAuth 2.0 authentication mechanisms enabled in Cloud Foundry (CF) by integrating them with the Kerberos authentication mechanisms in Hadoop. TAP integrates these two authentication frameworks using Apache Kerby, which allows a CF application to authenticate its users via OAuth2.0 and convert the OAuth token into a service ticket that is accepted by Kerberos. As a result, the user gets a single-sign on experience across the Data and Application Layers and each one of their action within TAP is performed within the contexts of that user. This deep context awareness allows TAP to audit each activity whether performed in analytical tools or within custom application. Deployment Model TAP supports two deployment models. Organizations that have already deployed Apache Hadoop in their data center (virtualized or on bare metal) can reuse the existing cluster or choose TAP to deploy it. SRV APP TAP APP DATA Hadoop CloudFoundry OTHER CloudFoundry TAP TAP IaaS Hadoop IaaS Physical Hardware Physical Hardware

From a networking standpoint, TAP deploys into two subnets. The components within a subnet are isolated and all communication are routed through trusted APIs. Control Accec ss MQ Clus ter Manager Node s Service Nodes Stat e Clus te r Wo rker No des System Operator Command Line or Browser Jump Box (SSH) Data Subnet IOT Devices Gateway or Direct Application User Web, Mobile or Desktop Data Scientist & App Builder ipython, RStudio, H 2O Proxy (MQT QTT) T) Proxy (HTT TTP) API (HTTP) App Controll er Health Manager User Services Run-time App Subnet TAP Deployment Recipes the common ways the platform can be extended. These recipes are reproducible, in many cases with minimal amount of development. Each recipe comprises of number of smaller microservices whose behaviors can be easily customized by the developer. The following diagram demonstrates one such recipe showing for hospital patient re-admission. Data Sources TCP/IP-based Protocols (REST, WS, XMPP, MQTT/AMQP/DDS) Unique Secure Endpoint HAProxy Gateway Go App Message Queue Kafka Intel Analytics Toolkit for Hadoop* Data Store HDFS Model API Re-admission Visualization & Dashboard App

TAP Adoption and Contributions TAP has been released as an open source project, including many popular and reliable components supported by a large community. The project, its code and documentation, including a Getting Started Guide with beginning steps for each user, can be found at http://trustedanalytics.github.io. To learn how a growing number of organizations support, use and are contributing to the TAP project to accelerate their Big Data Analytics application development across a broad range of use cases, visit http://.