Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP



Similar documents
Complex, true real-time analytics on massive, changing datasets.

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

INTRODUCTION TO CASSANDRA

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

From Spark to Ignition:

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Real Time Big Data Processing

HP Vertica at MIT Sloan Sports Analytics Conference March 1, 2013 Will Cairns, Senior Data Scientist, HP Vertica

HDP Hadoop From concept to deployment.

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

How To Handle Big Data With A Data Scientist

CitusDB Architecture for Real-Time Big Data

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Powerful Management of Financial Big Data

COULD VS. SHOULD: BALANCING BIG DATA AND ANALYTICS TECHNOLOGY WITH PRACTICAL OUTCOMES

Big Data and Your Data Warehouse Philip Russom

Microsoft Analytics Platform System. Solution Brief

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

BIG DATA-AS-A-SERVICE

Testing 3Vs (Volume, Variety and Velocity) of Big Data

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Processing and Analyzing Streams. CDRs in Real Time

NEEDLE STACKS & BIG DATA: USING EVENT STREAM PROCESSING FOR RISK, SURVEILLANCE & SECURITY ANALYTICS IN CAPITAL MARKETS

Dell* In-Memory Appliance for Cloudera* Enterprise

Data Refinery with Big Data Aspects

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

The Big Data Paradigm Shift. Insight Through Automation

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Big Data and Analytics in Government

Understanding traffic flow

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Search and Real-Time Analytics on Big Data

The Vertica Database simply fast!

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Oracle Big Data SQL Technical Update

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

Information Builders Mission & Value Proposition

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

The 3 questions to ask yourself about BIG DATA

HYPER-CONVERGED INFRASTRUCTURE STRATEGIES

Ali Eghlima Ph.D Director of Bioinformatics. A Bioinformatics Research & Consulting Group

Ubuntu and Hadoop: the perfect match

PALANTIR CYBER An End-to-End Cyber Intelligence Platform for Analysis & Knowledge Management

Enabling Real-Time Sharing and Synchronization over the WAN

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

THE AGE OF BIG DATA. Chula DataScience

BIG DATA TECHNOLOGY. Hadoop Ecosystem

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

ANALYTICS BUILT FOR INTERNET OF THINGS

Towards Smart and Intelligent SDN Controller

Why DBMSs Matter More than Ever in the Big Data Era

In-Database Analytics

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Choosing The Right Big Data Tools For The Job A Polyglot Approach

Big Data. Fast Forward. Putting data to productive use

[Hadoop, Storm and Couchbase: Faster Big Data]

Oracle Database 12c Plug In. Switch On. Get SMART.

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

GigaSpaces Real-Time Analytics for Big Data

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

The Lab and The Factory

SEIZE THE DATA SEIZE THE DATA. 2015

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Blazent IT Data Intelligence Technology:

Interactive data analytics drive insights

Comprehensive Analytics on the Hortonworks Data Platform

BIG DATA ANALYTICS For REAL TIME SYSTEM

An Oracle White Paper October Oracle: Big Data for the Enterprise

Big Data: Are You Ready? Kevin Lancaster

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Intro to Big Data and Business Intelligence

Protecting Big Data Data Protection Solutions for the Business Data Lake

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches.

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Workshop on Hadoop with Big Data

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Search Big Data with MySQL and Sphinx. Mindaugas Žukas

An Oracle White Paper June Oracle: Big Data for the Enterprise

Transcription:

Operates more like a search engine than a database Scoring and ranking IP allows for fuzzy searching Best-result candidate sets returned Contextual analytics to correctly disambiguate entities Embedded inside the database No need for Hadoop or customcode analytics True real-time analytics done per transaction and in aggregate On-the-fly linking IP A new kind of in-memory platform, built for in-memory applications Proprietary compression enables in-memory at scale Datasets reduced to 16% of original size Single-record decompression

1M documents to petabyte scale; streaming, constantly changing data, or more of same type of data Questions are unique to users; analytics driven by the information that comes through on the query Looking for the best answer, not a definitive one. Consider how/if/to what extent data changes. Need flexibility in the query formation and fuzzy search; DBMS must perform like a search engine as well as a database Finch = up to 16% of original size Need sub-second response times; enabling analytics per transaction. Need embedded models. Need storage costs reduced; must run on commodity hardware As in HTAP environments, others

Fraud Detection Monitoring financial transactions to identify patterns that could indicate fraud Internet of Things Collecting high- volume, high velocity sensor and telemetry data to improve performance, meet customer needs or support new product development Digital Communication/ Message Traffic Monitoring streaming feeds of message traffic to identify patterns, risks, trends CRM/Customer Service Engagement Aggregating customer information from multiple sources with different data models to improve the customer experience Personalization Ingesting clickstream data at high throughput rates to create and refine visitor profiles, serving up relevant content upon each return site visit Real-Time Big Data Ingesting a streaming feed of data to perform real-time analytics that inform business-critical decisions Cyber Security Protecting data from breaches, theft or misuse Legal Intelligence Mining legal documents (docket data, filings, etc.) to identify and disambiguate entities

Query Query Answer SQL Database Management System Answer In-Memory SQL Database Management System Query Query Answer Answer NoSQL Database Management System In-Memory NoSQL Database Management System

Query Candidate Set Best Answer (derived from analytic processing) Answer Aggregate Analytics (optional) Compression IP: Makes in-memory feasible at scale On-the-Fly Linking IP: Enables true real-time analytics inside Finch Scoring & Ranking IP: Means it acts more like a search engine than a DBMS

Analytics Outside the Database Batch Processing (Look Up Known, Precomputed Info) Q A Custom Code Initial Answer DBMS Static Data* DBMS Q A *Predetermined answers to predetermined questions about things you know you want to know

Search Today: (HP Autonomy, Solr, and even commercial search engines) Query Candidate Set Ranked Results Not in-memory No analytics Primarily text-oriented But FinchDB is. But FinchDB does. FinchDB handles text & numeric data.

A question we often encounter is how FinchDB handles streaming data in addition to static data and how it differs from the popular Apache Spark product. The primary difference is our ability to apply transactional, predictive analytics on the fly, inside the database using all available data. Below is a side-by-side comparison. Event Answer + Analytics Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html Models inside the database Apply predictive models Analyze on the fly Compute answers Go beyond look-up

Wires Original Content Corporate Blogs Stream Processing KB Inserts Online Media Entity Extraction Queries 33 PROPRIETARY & CONFIDENTIAL

Running on a four-node cluster in AWS Processing a streaming feed of news with 800,000 documents per day Disambiguating roughly 10 entities per document Leveraging a Person-KB of 500M features describing 3M unique people A Geo-KB with more than 30M+ unique places in the world And an Org-KB of more than 380M features describing more than 1.3 million unique companies, non-profits, governments and criminal organizations. Zabbix metrics from Thursday, July 30 at approx. 8:30 10:30 a.m. ET

At peak times, 70,000+ disambiguation queries per 5-minute window That s 233 queries per second.

Average response-time is 0.9 milliseconds.

Even at its peak, FinchDB is using just 12% of CPU capacity (on one node). During this window, CPU utilization averages around 2%.

Every query has search specifications and scoring/ranking specifications. We look at both to return a candidate set. In an entity disambiguation use case, to do that, we calculate a disambiguation score, based on: Name Score Topic Vector Score Query Candidate Set Best Answer Context Vector Score Prominence Score Answer Aggregate Analytics And we do that in less than a millisecond around every event. In this use case, an event is a new document coming into the system. The same would be true in other use cases. In a cybersecurity usecase, an event would be an attack. In this scenario, you could take what s happening in your environment and put that data as part of the query.

JSON-style, doc database Not in-memory, no embedded analytics, open-source In-memory, multiple deployment models, distributed architecture, No embedded analytics In-memory, HTAP processing use cases Only works on structured data In-memory, handles unstructured text HTAP processing use cases As a data fabric GridGain takes in SQL, NoSQL and Hadoop-analytic data. FinchDB does on-the-fly analytics inside the database meaning the need for Hadoop for could be eliminated altogether. Only works on structured data. Not true in-memory: uses a built-in, on-demand caching scheme. All transactional operations are done on in-memory data. Doc database Open source, cannot be cloud deployed/dbaas JSON-style, doc database, distributed architecture Not in-memory, open-source