Turn Big Data to Small Data

Turn Big Data to Small Data Use Qlik to Utilize Distributed Systems and Document Databases October, 2014 Stig Magne Henriksen

Image: kdnuggets.com From Big Data to Small Data

Agenda When do we have a Big Data problem How to use Qlik for Analyzing Big Data, breaking into small chunks of data that can be analyzed General strategies for handling of Big Data Discussion on how to handle distributed systems Discussion on how Qlik can read data from the MongoDB database Discussion on how the use Qlik to read data from Hortonworks Summary

When do we have a Big Data problem? What happens when the amount of data is so huge that it is not possible store it in a database? Nor is it possible to store it in the memory of a computer Too many bytes(volume) What happens when the rate of change(new sources) of the data is so frequent that a solution created a couple of weeks ago - today is out of date? Too many sources(variety)

When do we have a Big Data problem II? What happens when data from Internet of things and mobile Apps increase immensely? Too high rate(velocity) What happens when a company with 200 branches with different variations on the nearly identical excel spreadsheet? Non scalable analysis

Have to Find the Useful Information Image: thestoragealchemist.com

Why Qlik with Big Data? Flexible Deployment Models In Memory with use of ODBC or OLE-DB Direct Discovery Application (Document) Chaining Combine Big Data and traditional data sources In Memory Direct Discovery Hybrid

Qlik In-Memory Approach Loads compressed data into memory Enables associative search and analysis 100 s millions to billions of rows of data

Qlik Direct Discovery Approach Combines the associative capabilities of the Qlik in-memory dataset with a query model where: The aggregated query result is passed back to a Qlik object without being loaded into the Qlik data model The result set is still part of the associative experience Capability to Drill to Detail records Qlik In-Memory Data Model Batch Load Qlik Application Direct Discovery

Application (Document) Chaining Navigate among Qlik applications Maintain Selections / Context 1) User makes selections in Application 1 2) Click a button to Application Chain 3) Application 2 opened, selections are transferred and applied

Why use them? Distributed systems Advangtages of distributed computed platforms - Parallelize I/O to quickly scan large datasets Cost effiency - Commodity nodes (cheap but unreliable) - Commodity network(might have low bandwith) - Automatic fault tolerance (few admins) - Easier to use(fewer programmers)

Two different approaches - Hortonworks Direct Discovery Can access data from external sources into Qlik Will not load data until it is requested from the app Only meta data is loaded Real time load of data Can access several tables Use ODBC and in memory Access to Hortonworks Can read complex objects Use the Hive interface send SQL to Hive that translate this into MapReduce statements Utilize the power of several servers Result sent back to Qlik No need to define a database View

Qlik and Hortonworks 100 s millions rows into Memory Broad Application to discover new trends Aggregates / Detail Deep Application to confirm and take action Billions of rows via Direct Discovery Direct Discovery Broad Application to discover new trends Deep Application to confirm and take action

Easy to setup Result from working with Hortonworks ODBC connects fine - read of data is straight forward Can do qualified calls via the ODBC (SQL based calls) Direct discovery works best when used on aggregated level HIVE is per definition not suited for interactive loads with many queries hence be careful with frequent Direct Discovery calls

MongoDB - New programming model Object oriented programming - A Document Database - Simple and fast to implement - No Complicated SQL (NOSQL) - Can be much faster than traditional SQL databases

MongoDB II - Can spread the DB across multiple machines - Limited multi-record transactional consistency, hence easier to implement across different machines - Often used in web-applications - Back-end for mobile Apps

Two different approaches - MongoDB Direct Discovery Can access data from external sources into Qlik Will not load data until it is requested from the app Only meta data is loaded Real time load of data Use SIMBA ODBC Access to MongoDB Can read documents from the database Can read complex objects from a document Can read sub levels of each instance in the Collections Can use the SQL language although this is NOSQL Result sent back to Qlik

Qlik and MongoDB SIMBA ODBC Broad Application to discover new trends MongoDB Direct reads Deep Application to confirm and take action

Result from working with MongoDB Easy to setup ODBC connects fine - read of data straight forward Can do qualified calls via the ODBC (SQL based calls, although this is a NOSQL database) Can read complex documents and read data on different levels It is fast to retrieve data

Summary Qlik is well suited for tapping into document database MongoDB - and read data and integrate into already existing analysis It is recommended to use different strategies according to your needs Direct Discovery when reading aggregated data ODBC to read data on more detail level Application chaining to swap between different levels of data ODBC and in memory approach works best with Hortonworks. Hive is too slow use interactive access approaches Big Data need strong visualization tools in this context Qlik is well suited for this task

Image: beautifulinsanity.com Small or Big Data - Result

Thank You