The Rembrandt Group Strategies for BIG DATA 2015-2016



Similar documents
5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

NoSQL for SQL Professionals William McKnight

Data Virtualization for Agile Business Intelligence Systems and Virtual MDM. To View This Presentation as a Video Click Here

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

The Future of Data Management

Oracle Big Data SQL Technical Update

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

The Next Wave of Data Management. Is Big Data The New Normal?

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Native Connectivity to Big Data Sources in MSTR 10

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Big Data and Analytics in Government

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

VIEWPOINT. High Performance Analytics. Industry Context and Trends

Getting Started Practical Input For Your Roadmap

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Increase Agility and Reduce Costs with a Logical Data Warehouse. February 2014

Big Data Integration: A Buyer's Guide

Tap into Hadoop and Other No SQL Sources

The Future of Data Management with Hadoop and the Enterprise Data Hub

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

How To Make Sense Of Data With Altilia

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

An Oracle White Paper October Oracle: Big Data for the Enterprise

Bringing Big Data into the Enterprise

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Agile Business Intelligence Data Lake Architecture

How To Handle Big Data With A Data Scientist

Real Time Big Data Processing

MarkLogic Enterprise Data Layer

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Are You Ready for Big Data?

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

The 3 questions to ask yourself about BIG DATA

Big Data Technologies Compared June 2014

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

Data Integration Checklist

INTELLIGENT BUSINESS STRATEGIES WHITE PAPER

BIG DATA TRENDS AND TECHNOLOGIES

Big Data for Investment Research Management

Search and Real-Time Analytics on Big Data

HDP Hadoop From concept to deployment.

Choosing The Right Big Data Tools For The Job A Polyglot Approach

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Data Modeling for Big Data

INTRODUCTION TO CASSANDRA

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

USING BIG DATA FOR INTELLIGENT BUSINESSES

Making Sense of Big Data in Insurance

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

So What s the Big Deal?

Luncheon Webinar Series May 13, 2013

Transforming the Telecoms Business using Big Data and Analytics

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Ganzheitliches Datenmanagement

How the oil and gas industry can gain value from Big Data?

Data Refinery with Big Data Aspects

HDP Enabling the Modern Data Architecture

Dell Information Management solutions

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Are You Ready for Big Data?

Introduction to Oracle Business Intelligence Standard Edition One. Mike Donohue Senior Manager, Product Management Oracle Business Intelligence

This Symposium brought to you by

Data Virtualization A Potential Antidote for Big Data Growing Pains

SAP BusinessObjects Business Intelligence 4.1 One Strategy for Enterprise BI. May 2013

Big Data Solutions. Portal Development with MongoDB and Liferay. Solutions

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Ali Eghlima Ph.D Director of Bioinformatics. A Bioinformatics Research & Consulting Group

How To Make Data Streaming A Real Time Intelligence

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Next-Generation Cloud Analytics with Amazon Redshift

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Business Analytics In a Big Data World Ted Malone Solutions Architect Data Platform and Cloud Microsoft Federal

Big Data Analytics Nokia

An Oracle White Paper June Oracle: Big Data for the Enterprise

How To Use Big Data For Business

How to make BIG DATA work for you. Faster results with Microsoft SQL Server PDW

The big data business model: opportunity and key success factors

Databricks. A Primer

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

Databricks. A Primer

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Transcription:

The Rembrandt Group Strategies for BIG DATA 2015-2016

Big Data Interesting applications are data hungry Increased number & variety of sources Realization that delete is not an option The data grows over time & is immobile 100 TB @ 1Gbps ~= 10 days Compute needs to go the data Dramatic decline of hardware cost Big Data clusters are the new libraries (J. Campbell, et al., Intel Research Pittsburgh, 2007) Source: Open Cirrus Project

Big Data Sources: "Big Data: The Next Frontier for Innovation, Competition and Productivity." US Bureau of Labor Statistics McKinsley Global Institute Analysis 1 zettabyte? = 1 million petabytes = 1 billion terabytes = 1 trillion gigabytes The traditional ways of managing data fall apart at this scale!

Big Data Problems Government Command and Control Fraud detection and cyber security. Compliance and regulatory analysis. Content Management Content Archiving Data Protection Record Management Multi-Source Intelligence Analysis Weather Prediction Energy consumption and carbon footprint management. Infectious Disease Detection Crime Prevention Traffic Management Complex Event Analysis and Decision Support E-Discovery Health & Life Sciences Fraud Detection Campaign and sales program optimization. Brand management. Patient care quality and program analysis. Supply-chain management. Drug discovery and development analysis. Telecommunications Network performance and optimization. Revenue assurance and price optimization. Customer churn prevention. Campaign management and customer loyalty. Call Detail Record (CDR) analysis. The big data problems span industries and domains and will only get worse

Dimensions of the Problem Four Dimensions Drive the Need for Big Data Solutions Volume Variety Velocity Variability Emerging NoSQL Traditional RDBMS Source: Forester Research, Hadoop World 2011 Presentation

Two Universes are the New Reality Source: DeWitt, Nehme, Microsoft Jim Gray Systems Lab Structured & Unstructured Each solution will have its place in the enterprise Relational DB Systems Structured data w/ known schema ACID Transactions SQL Rigid Consistency Model ETL Longer time to insight Maturity, stability, efficiency NoSQL Systems (Un)(Semi)structured data w/o schema No ACID No transactions No SQL Eventual consistency No ETL Faster time to insight Flexibility 6

Why NoSQL? SQL: Data Arrives 1 Cleanse the data Transform the data Load the data 3 4 5 Derive a schema 2 RDBMS SQL Queries 6 NoSQL: Data Arrives No cleansing! No ETL! No load! NOSQL System Analyze data where it lands! Application Program 1 2 Source: DeWitt, Nehme, Microsoft Jim Gray Systems Lab 7

It all started with Google 2003 MR/GFS 2006 Hadoop Massive amounts of click stream data that had to be stored and analyzed Requirements: Scalable to PBs and 1000s of nodes Highly fault tolerant Simple to program against Store Hadoop = HDFS + MapReduce Process Store Process GFS + MapReduce distributed $ fault-tolerant new programming paradigm Source: DeWitt, Nehme, Microsoft Jim Gray Systems Lab

Hadoop A storage & processing framework for Big Data Open Source version of Google s work Provides scalability and a high degree of fault tolerance Ability to quickly analyze massive collections of records without forcing data to first be modeled, cleansed, and loaded Easy to use programming paradigm for writing and executing analysis programs that scale to 1000s of nodes and PBs of data Low, up front software and hardware costs Think data warehousing for Big Data It does not address the Data integration problem

The Data Integration Conundrum Situation we have: A data-model provides the interface to the data AND the Semantics of the data For both structured & unstructured data to be integrated it must have a Unified interface Unified Interface Single data-model + Single perspective But there is no single right way to represent all Knowledge There can be no single data-model for all information Situation we want: A unified interface to All the data that Does NOT impose a specific Data-model UNCLASSIFIED // FOUO // Institute for Modern Intelligence Proprietary 10

A Big Data Architecture Domain Based and Pre-Packaged Applications; Can be COTS, GOTS or CUSTOM Middleware Layer that provides standard, shared application & data services; Provides Platform as a Service for Applications RemIT Analytics & Visualization Tools Open Source Application Frameworks Public API Public API Data & Processing Syndicate Analytics Data & Processing Syndicate Framework UDS Hadoop Distribution (Accumulo, HDFS) Infrastructure Layer that provides basic network services, capacity provisioning and system management services; Provides Infrastructure as a Service for Platforms Open Stack Enterprise Services Infrastructure Cloud Open Compute-Based Hardware

Big Data Core Components Visualization EXPLORE Visualize the data. Predictive dashboards, reports and insights 3 Steps of BigData 3 Business Intelligence EXTRACT Transform and apply 2 analysis, enrich, reveal meaningful and predictive patterns and extract insights Infrastructure Hadoop, Data Marts, Data warehouse, NoSql IMPORT 1 Ingest large amount of Structured and Unstructured Data 12

RemIT Big Data Toolkit Included with Data & Processing Syndicate (DPS) Syndicate Search Engine using Solr Data-Space NoteBook Data-Space Evolved Video Application Syndicate C2 Sand Table Additional RemIT Data Analytics & Visualization Data Ingest and Normalization Text Exploitation w/ NLP CloudBase Geospatial Analysis Visualization and Mapping Widgets

Use Case : E-Discovery Problem Large volume of unstructured and semistructured data exists, often in disjoint systems Traditional Text Analysis tools cannot handle petabytes and Exabytes of data Solution Big Data Analytics tools with high precision document indexing, search, and retrieval capabilities Geospatial Analytics and Visualizations Semantic and Predictive Analytics Benefits Cheap Commodity Hardware Open Source technologies with no vendor lock-in Ability to scale storage and computation as demand grows Hadoop will be the nucleus of the next-generation EDW in the cloud. Forrester 14

Use Case : E-Discovery 3 Explore Visualize the data Advanced analytics and visualizations to find trends, anomalies, and new relationships 2 Extract Transform and apply analysis, enrich and extract insights Create index of feed information, Entity extraction and location information Data Processing System, Lucene/Solr Search, NLP 1 Import Ingest data from various sources Consume vast amounts Twitter feeds, 15

Why Us? RemIT s solution is an Open Source platform with our domain based tools Consists of the Open source Hadoop distribution, Open source Data & Processing Syndicate (DPS) from the Intel community, and RemIT s own analytics & visualization capabilities Ingests, processes & stores extremely large sets of structured and unstructured data leveraging the Data & Processing Syndicate (DPS) Provides an ecosystem around the data services that includes an application services, analytics, GUI s and interfaces. The Data & Processing Syndicate is a discriminating approach big data Goes beyond the basic big data platform Solves fundamental data integration problems An open source platform, in the public domain, and is free to use Has been designed to be simple, fast and applicable to any domain Designed and Implemented in the Intel community with demonstrable benefit Low Cost, Flexible Approach to Big Data

Summary Big Data is not another fad Every domain will need to extract knowledge from the growing mass of unstructured and structured data There are new tools that can manage the explosion of data in the enterprise RemIT offers a unique Open Source approach We enable All Data, Information, and Knowledge regardless of Modality, Structure, or Semantics to be Unified within One distributed Data-Space Low Cost, Flexible and Domain Oriented solution Many deployment options to get started today

Backup Slides

Rembrandt Group

Big Data Core Components 3 Steps of BigData Visualization EXPLORE Visualize the data. Predictive dashboards, reports and insights 3 Business Intelligence EXTRACT Transform and apply 2 analysis, enrich, reveal meaningful and predictive patterns and extract insights Infrastructure Hadoop, Data Marts, Data warehouse, NoSql IMPORT 1 Ingest large amount of Structured and Unstructured Data 20

Big Data Architecture 3 Explore BIG DATA ANALYTICS BI Analyst Visualize the data Tools Analytics Apps Traditional BI tools 2 Extract Transform and apply analysis, enrich and extract insights IT Analyst Data Analyst Business Intelligence (Process Data, Semantic and Predictive Analysis) 1 Unstructured Data BIG DATA Import Ingest data from various sources Structured Data Map HDFS Reduce Hadoop NoSQL DB Data Marts Traditional Data Warehouse ( Eg Teradata, Oracle) 21

Data & Processing Syndicate A Big Data Data Management Solution from the Intel Space Enables All Data, Information, and Knowledge regardless of Modality, Structure, or Semantics to be Unified within One distributed Data-Space Allows the Diversity of Exploration, Integration, Enrichment, and Exploitation Processes to be Developed, Tested, and put into Production Together Provides a Solution to the Data Integration Problem through consideration of data models at a higher level of abstraction Distill a minimal set of elements sufficient to capture ANY data model UNCLASSIFIED // FOUO // Institute for Modern Intelligence Proprietary 22

Data & Processing Syndicate UNCLASSIFIED // FOUO // Institute for Modern Intelligence Proprietary 23

RemIT Data Analytics & Visualization RemIT Data Ingest and Normalization Unstructured data analysis and extraction of metadata Categorization, clustering, and taxonomy generation Entity and attribute extraction and normalization Relationship and event coreference Detect and extract text content from documents HTML, XML, RTF, EPF MS Office & OpenDoc Zip, gzip, and tar Outlook & email archives JPEG and Flash Video Language detection 24

RemIT Data Analytics & Visualization RemIT Text Exploitation Conceptual categorization of objects Parametric based indexing and retrieval Federated retrieval capability Generation of taxonomy information Implements text zoning; entity, event, & relationship identification; and document modeling Utilizes UIMA, the industry standard, to construct analytic pipelines, implement design patterns, & data model representations Natural Language Processing (NLP) tasks include tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution 25

RemIT Data Analytics & Visualization RemIT Geospatial Analysis Big Data Geospatial Indexing, Filtering, Query, store, index, query, analyze, and deliver geospatial raster data Hadoop Spatial Query Engine Spatial indexing system based on space-filling curves (Hilbert) Bulk-loading R*-trees with three-dimensional space-filling curves (AoA k-d-tree) Hadoop CloudBase Spatial Engine AOI queries, spatial joins, and spatial analysis operations implemented as MapReduce CloudBase Spatial Query Extension z Feature queries performed via CloudBase query language to ease complexity Spatial Query Hadoop Engine. HDFS 26

RemIT Data Analytics & Visualization RemIT Visualization Toolkit Integrate and enhance analytics as a set of widgets Spatially visualize analytic results using lightweight modular OZONE Widget Framework (OWF) widgets Expand discovery and search filtering to include points of interest or an area of responsibility bounding box approach Map and layer controls allow operator to easily navigate and display content User interaction to select content and display in other widgets via inter-widget communication Results Search Document Communication Channels Analytics Map Charts Communication Channels 27

Why the Data & Processing Syndicate? Up to the platform level they're all pretty much the same All utilize "NoSQL" technologies that turn commodity servers into a distributed database and parallel computation engine They may employ different flavors of "NoSQL" technologies (HBase, MongoDB, Cassandra,...) We use Cloudbase (NSA GFE) because it provides MLS data security The big difference comes in how data is organized and processed NoSQL datastores do not use schemas like relational data bases do They simply store data as key-value pairs It's up to you to figure out how to represent your domain data-model within that structure Other DoD & Intel clouds shape the data to suit their analytics Traditional ETL processes transform the data according to some domain-specific data-model Data from disparate sources, even within the same cloud are neither unified nor integrated except by traditional data-model harmonization Which do not scale and often distort / lose information The Data & Processing Syndicate breaks the data barriers Provides a unified interface to all data Allows all data to be discovered, explored, searched, enriched, managed, and exploited Semantics provide meaning, not boundaries The Data & Processing Syndicate Makes New Things Possible Any number of semantic perspectives may be supported Data can be continuously enriched and utilized within and across multiple contexts Applications can mash-up any data / information / knowledge without data transformations / data-model harmonization Domain-specific data-models can be harmonized in new ways

Our Products & Services Big Data Integration Big Data Analytics Big Data SaaS Big Data Appliance

Products & Services Big Data Integration Integration Service Offering 1 Month Base Service Install Hadoop Install Data and Processing Syndacate Load a Customer Data sets Demonstrate Ingest, Search & Extract Basic 2 day technical training Customer provides Infrastructure RemIT Analytics & Visualization Tools Open Source Application Frameworks Public API DPS Hadoop Distribution (CloudBase, HDFS) Public API Open Stack Enterprise Services Infrastructure Cloud Data & Processing Syndicate Analytics Data & Processing Syndicate Framework Open Compute-Based Hardware

Products & Services Big Data Analytics Software Based Offering Entire RemIT Visualization, Interface & Analytic Catalog Includes Open Source Application Frameworks One price for everything plus a yearly maintenance fee for SW updates & support Customer must provide infrastructure and Hadoop based platform 2 day training included in price RemIT Analytics & Visualization Tools Open Source Application Frameworks Public API DPS Hadoop Distribution (CloudBase, HDFS) Public API Open Stack Enterprise Services Infrastructure Cloud Data & Processing Syndicate Analytics Data & Processing Syndicate Framework Open Compute-Based Hardware

Products & Services Big Data Appliance Product Based Offering Provide Entire RemIT Visualization, Interface & Analytic Catalog Hadoop Distribution and IMI s Data Management Services The Open Stack distribution for cluster management RemIT pre-integrated infrastructure services 2 day training, hardware warranty & software maintenance included in the price RemIT Analytics & Visualization Tools Open Source Application Frameworks Public API DPS Hadoop Distribution (CloudBase, HDFS) Public API Open Stack Enterprise Services Infrastructure Cloud Open Compute-Based Hardware Data & Processing Syndicate Analytics Data & Processing Syndicate Framework

Products & Services Big Data SaaS Service Based Offering Big Data Services executed at RemIT datacenters Simple per GB price for all services Data loaded by customer into RemIT cloud Services would be accessible via RemIT portal Customer can leave at anytime 2 Day training services RemIT Analytics & Visualization Tools Open Source Application Frameworks Public API DPS Hadoop Distribution Cyber Forensics (CloudBase, HDFS) E-Discovery Public API Open Stack Enterprise Services Infrastructure Cloud Open Compute-Based Hardware Data & Processing Syndicate Analytics Services Compliance Content Management Content Archiving Data Protection Data & Processing Syndicate Framework Record Management