Operates more like a search engine than a database Scoring and ranking IP allows for fuzzy searching Best-result candidate sets returned Contextual analytics to correctly disambiguate entities Embedded inside the database No need for Hadoop or customcode analytics True real-time analytics done per transaction and in aggregate On-the-fly linking IP A new kind of in-memory platform, built for in-memory applications Proprietary compression enables in-memory at scale Datasets reduced to 16% of original size Single-record decompression
1M documents to petabyte scale; streaming, constantly changing data, or more of same type of data Questions are unique to users; analytics driven by the information that comes through on the query Looking for the best answer, not a definitive one. Consider how/if/to what extent data changes. Need flexibility in the query formation and fuzzy search; DBMS must perform like a search engine as well as a database Finch = up to 16% of original size Need sub-second response times; enabling analytics per transaction. Need embedded models. Need storage costs reduced; must run on commodity hardware As in HTAP environments, others
Fraud Detection Monitoring financial transactions to identify patterns that could indicate fraud Internet of Things Collecting high- volume, high velocity sensor and telemetry data to improve performance, meet customer needs or support new product development Digital Communication/ Message Traffic Monitoring streaming feeds of message traffic to identify patterns, risks, trends CRM/Customer Service Engagement Aggregating customer information from multiple sources with different data models to improve the customer experience Personalization Ingesting clickstream data at high throughput rates to create and refine visitor profiles, serving up relevant content upon each return site visit Real-Time Big Data Ingesting a streaming feed of data to perform real-time analytics that inform business-critical decisions Cyber Security Protecting data from breaches, theft or misuse Legal Intelligence Mining legal documents (docket data, filings, etc.) to identify and disambiguate entities
Query Query Answer SQL Database Management System Answer In-Memory SQL Database Management System Query Query Answer Answer NoSQL Database Management System In-Memory NoSQL Database Management System
Query Candidate Set Best Answer (derived from analytic processing) Answer Aggregate Analytics (optional) Compression IP: Makes in-memory feasible at scale On-the-Fly Linking IP: Enables true real-time analytics inside Finch Scoring & Ranking IP: Means it acts more like a search engine than a DBMS
Analytics Outside the Database Batch Processing (Look Up Known, Precomputed Info) Q A Custom Code Initial Answer DBMS Static Data* DBMS Q A *Predetermined answers to predetermined questions about things you know you want to know
Search Today: (HP Autonomy, Solr, and even commercial search engines) Query Candidate Set Ranked Results Not in-memory No analytics Primarily text-oriented But FinchDB is. But FinchDB does. FinchDB handles text & numeric data.
A question we often encounter is how FinchDB handles streaming data in addition to static data and how it differs from the popular Apache Spark product. The primary difference is our ability to apply transactional, predictive analytics on the fly, inside the database using all available data. Below is a side-by-side comparison. Event Answer + Analytics Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html Models inside the database Apply predictive models Analyze on the fly Compute answers Go beyond look-up
Wires Original Content Corporate Blogs Stream Processing KB Inserts Online Media Entity Extraction Queries 33 PROPRIETARY & CONFIDENTIAL
Running on a four-node cluster in AWS Processing a streaming feed of news with 800,000 documents per day Disambiguating roughly 10 entities per document Leveraging a Person-KB of 500M features describing 3M unique people A Geo-KB with more than 30M+ unique places in the world And an Org-KB of more than 380M features describing more than 1.3 million unique companies, non-profits, governments and criminal organizations. Zabbix metrics from Thursday, July 30 at approx. 8:30 10:30 a.m. ET
At peak times, 70,000+ disambiguation queries per 5-minute window That s 233 queries per second.
Average response-time is 0.9 milliseconds.
Even at its peak, FinchDB is using just 12% of CPU capacity (on one node). During this window, CPU utilization averages around 2%.
Every query has search specifications and scoring/ranking specifications. We look at both to return a candidate set. In an entity disambiguation use case, to do that, we calculate a disambiguation score, based on: Name Score Topic Vector Score Query Candidate Set Best Answer Context Vector Score Prominence Score Answer Aggregate Analytics And we do that in less than a millisecond around every event. In this use case, an event is a new document coming into the system. The same would be true in other use cases. In a cybersecurity usecase, an event would be an attack. In this scenario, you could take what s happening in your environment and put that data as part of the query.
JSON-style, doc database Not in-memory, no embedded analytics, open-source In-memory, multiple deployment models, distributed architecture, No embedded analytics In-memory, HTAP processing use cases Only works on structured data In-memory, handles unstructured text HTAP processing use cases As a data fabric GridGain takes in SQL, NoSQL and Hadoop-analytic data. FinchDB does on-the-fly analytics inside the database meaning the need for Hadoop for could be eliminated altogether. Only works on structured data. Not true in-memory: uses a built-in, on-demand caching scheme. All transactional operations are done on in-memory data. Doc database Open source, cannot be cloud deployed/dbaas JSON-style, doc database, distributed architecture Not in-memory, open-source