How to Navigate Big Data with Ad Hoc Visual Data Discovery Data technologies are rapidly changing, but principles of 30 years ago still apply today



Similar documents
Five Reasons Spotfire Is Better than Excel for Business Data Analytics

Resource Sizing: Spotfire for AWS

Empowering the Masses with Analytics

Keeping up with the KPIs 10 steps to help identify and monitor key performance indicators for your business

TOP 8 TRENDS FOR 2016 BIG DATA

Streaming Analytics and the Internet of Things: Transportation and Logistics

TIBCO StreamBase High Availability Deploy Mission-Critical TIBCO StreamBase Applications in a Fault Tolerant Configuration

Five Tips for Presenting Data Analyses: Telling a Good Story with Data

TIBCO Live Datamart: Push-Based Real-Time Analytics

whitepaper Five Principles for Integrating Software as a Service Applications

TIBCO AT-A-GLANCE COMPANY OVERVIEW: CORPORATE EXECUTIVES: CUSTOMERS VERTICALLY DIVERSIFIED: CUSTOMERS GLOBALLY DIVERSIFIED: AREAS OF MARKET FOCUS:

6 Steps to Faster Data Blending Using Your Data Warehouse

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Integration Maturity Model Capability #1: Connectivity How improving integration supplies greater agility, cost savings, and revenue opportunity

Combating Fraud, Waste, and Abuse in Healthcare

Four Clues Your Organization Suffers from Inefficient Integration, ERP Integration Part 1

Spotfire and Tableau Positioning. Summary

Predictive Analytics with TIBCO Spotfire and TIBCO Enterprise Runtime for R

Tableau Visual Intelligence Platform Rapid Fire Analytics for Everyone Everywhere

Partner Collaboration Blueprint for ICD-10 Transition

whitepaper Predictive Analytics with TIBCO Spotfire and TIBCO Enterprise Runtime for R

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

SAP BusinessObjects Business Intelligence 4.1 One Strategy for Enterprise BI. May 2013

A Guide Through the BPM Maze

Introducing Oracle Exalytics In-Memory Machine

BTIP BCO ipro M cess Suite

WHITE PAPER. Data Migration and Access in a Cloud Computing Environment INTELLIGENT BUSINESS STRATEGIES

Powerful analytics. and enterprise security. in a single platform. microstrategy.com 1

R / TERR. Ana Costa e SIlva, PhD Senior Data Scientist TIBCO. Copyright TIBCO Software Inc.

Oracle Financials Cloud Modernize Finance

WHITEPAPER. Beyond Infrastructure Virtualization Platform Virtualization, PaaS and DevOps

OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT

The 3 questions to ask yourself about BIG DATA

Bring your data to life with Microsoft Power BI. Peter Myers Bitwise Solutions

Big Data at Cloud Scale

Actian SQL in Hadoop Buyer s Guide

Management Consulting Systems Integration Managed Services WHITE PAPER DATA DISCOVERY VS ENTERPRISE BUSINESS INTELLIGENCE

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Integration Maturity Model Capability #5: Infrastructure and Operations

6.0, 6.5 and Beyond. The Future of Spotfire. Tobias Lehtipalo Sr. Director of Product Management

TIBCO Cyber Security Platform. Atif Chaughtai

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Oracle Big Data Discovery The Visual Face of Hadoop

AtScale Intelligence Platform

From Spark to Ignition:

Integration: Why Good Enough Doesn t Cut It 13 ways to mess with success

Native Connectivity to Big Data Sources in MSTR 10

Oracle Big Data SQL Technical Update

Self-Service Business Intelligence: The hunt for real insights in hidden knowledge Whitepaper

TIBCO ActiveSpaces Use Cases How in-memory computing supercharges your infrastructure

INTRODUCTION TO CASSANDRA

SOLUTION BRIEF. How to Centralize Your Logs with Logging as a Service: Solving Logging Challenges in the Face of Big Data

The IBM Cognos family

End-to-end Processing with TIBCO Managed File Transfer (MFT) Improving Performance and Security during Internet File Transfer

IBM Cognos Insight. Independently explore, visualize, model and share insights without IT assistance. Highlights. IBM Software Business Analytics

Predictive Customer Interaction Management for Insurance Companies

Tap into Hadoop and Other No SQL Sources

QlikView 11.2 SR5 DIRECT DISCOVERY

SQL Server 2016 BI Any Data, Anytime, Anywhere. Phua Chiu Kiang PCK CONSULTING MVP (Data Platform)

Log Management Solution for IT Big Data

Focus on the business, not the business of data warehousing!

Exploring the Synergistic Relationships Between BPC, BW and HANA

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

BI VERDICT. The ultimate report on Business Intelligence. TIBCO Spotfire 5. [Analysts: Dr. Christian Fuchs, Larissa Seidler, April 2013]

The QlikView Business Discovery platform

Understanding the Value of In-Memory in the IT Landscape

SAP SE - Legal Requirements and Requirements

Together we can build something great

The Clear Path to Business Intelligence

TRANSITIONING TO BIG DATA:

Microsoft Power BI. Nov 21, 2015

IBM Cognos Express Essential BI and planning for midsize companies

Tableau Online. Understanding Data Updates

Ad Hoc Analysis of Big Data Visualization

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Introduction. Various user groups requiring Hadoop, each with its own diverse needs, include:

Making confident decisions with the full spectrum of analysis capabilities

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Customer Case Study. Sharethrough

INTELLIGENT BUSINESS STRATEGIES WHITE PAPER

Oracle Data Integrator 12c (ODI12c) - Powering Big Data and Real-Time Business Analytics. An Oracle White Paper October 2013

Embedded Analytics & Big Data Visualization in Any App

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS

Predictive Customer Interaction Management

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Real-Time Data Analytics and Visualization

Transcription:

How to Navigate Big Data with Ad Hoc Visual Data Discovery Data technologies are rapidly changing, but principles of 30 years ago still apply today INTRODUCTION Data is the heart of TIBCO Spotfire. It s important to understand how to load it, but also how Spotfire consumes and processes it. Many think that step changes in the volume of data being stored and processed, as well as in new technologies, alter the approach to data access. The reality is that the same basic questions need to be asked whenever you consider working with any data source. Let s strip away all the technical jargon and formulate a real world scenario based on simple physics. You ll see that the principles of 30 years ago still apply today. BACK TO BASICS Suppose you have gone back 30 years and are working in an office as a data analyst in a small country. The records you need to analyze are stored downstairs in a series of filing cabinets. They are the medical records of 5 million citizens. Your boss comes along, and says he needs to know the average number of times over their entire life that a person visits a doctor. The question is simple, and the answer is a single floating point number. One way to find the answer would be to go down to the records department, make a copy of all the records related to doctor visits for every patient, cart those back up to your office, and count them. Another way would be to call down to Records and ask them to scan through the records, compile the result, and send it up to you.

WHITEPAPER 2 The key to visual data discovery on big data is to access data in a combination of different ways at the same time, from the same analysis or dashboard. What are some initial conclusions we can draw and questions that remain for each method? The first method seems awfully wasteful. You haul all those records up the elevator just to count them, when you could have done that downstairs. However: Once you have the records, you can answer other questions without going back downstairs. How will you know when new records are available, and how will you get them? If the records department closes, you can still work on your copies. How will the size and speed of the elevator impact your work? How much office space will you need to store and work on the records? What security precautions will you need to take for this sensitive information? The second method seems much more efficient. However: How many other requests does the records department have to fill? What happens as the questions become more complex, such as requiring segmentation by age and city? What happens if you ask the records department to run a Bayesian Inference? Enterprise level database systems are often capable of holding petabytes of data and quickly and efficiently running queries against them. However, not all databases are created equal, so it s a good idea to run some test queries before committing to this mode of operation. IN-MEMORY In in-memory mode, Spotfire reads all the raw data from a database, file, or system into its own internal memory. It then sorts the data into a format that allows it to do the calculations required for fast and efficient visualizations. Hopefully you can see this technique is analogous to our first method of accessing the medical records. With this is mind, let s revisit the previous pro s and con s: The in-memory method seems awfully wasteful because you copy all the data across the network. However: Do you want to wait as long as it may take to load the data? Once you have the data, you can answer other questions without going back to the original database or system. How will you know when new data is available, and how will you get it? If you disconnect from the network, you can still work on your copy of the data. How will the size and speed of the network impact your work? How much memory will you need to store the data and work on it? What security precautions will you need for this sensitive information? One obvious limitation is that you cannot read more data into memory than available capacity allows. On a desktop or laptop computer you might have 4, 8, or 16 gigabytes of memory. On servers using TIBCO Spotfire Web Player or TIBCO Spotfire Automation Services, there may be 32, 64, or more gigabytes of available memory. Spotfire is capable of loading over 100 gigabytes of data, but it will take some time.

WHITEPAPER 3 IN-DATASOURCE In in-datasource mode, Spotfire leaves the data being analyzed in the source database or system. It then sends questions to that database or system and uses the results to display data visualizations. Hopefully you can also see that this is analogous to our second method of accessing the medical records. Once again, let s revisit the previous pro s and con s: This method seems much more efficient. However: How many threads does the database engine have available to process your requests? How will performance degrade as the questions become more complex, such as requiring segmentation by age and city? What will happen if you ask the database to run a Bayesian Inference? Enterprise level database systems are often capable of holding petabytes of data and quickly and efficiently running queries against them. However, not all databases are created equal, so it s a good idea to run some test queries before committing to this mode of operation. Another approach that falls into this mode of data retrieval is to pre-create a set of query answers on a regular basis so that the most common questions can be answered quickly and efficiently. The most common method of implementing this is often called a cube. Cubes are pre-built with specific dimensions and measures that contain pre-built answers to common questions. Spotfire is able to use cubes to visualize data, provided the dimensions and measures match what you want to visualize. Spotfire comes with connectors for a wide range of data sources, which allows it to work in-datasource. Connectors are, in most cases, enabled when the corresponding driver has been installed. The driver is needed for TIBCO Spotfire Analyst on PC and for TIBCO Spotfire Consumer and TIBCO Spotfire Business Author on the Spotfire Web Player Server. As we have seen, the ability to do calculations depends on the data source doing the job. Relational data sources have differing capabilities, and cubes, with their pre-calculated measures, are in many ways very different from relational data sources. Spotfire accommodates all these different technologies. DATA-ON-DEMAND In addition to the two options already mentioned, Spotfire offers a third option for working with data, a hybrid of the other two. Using the data-on-demand mode, data is retrieved from the source system when it is needed. Consider our medical records example. Suppose your initial dashboard visualizations are setup to display the average number of doctor s visits using either in-memory or in-datasource data access. You identify a number of outliers, and now you want to drill into the actual visit notes to see if you can explain your observations. You don t want to bring the text history of all the patients into memory or retrieve them through an in-datasource query because that would mean a huge amount of extra, irrelevant data. You only want to drill into the outliers. So, you retrieve those extra pieces of data on-demand, when you need them. When you re finished reading them, you discard them to free up memory. This is data-on-demand as a master-detail pattern. This is the most common pattern, but another useful pattern is to ask some questions and then load the data once all the relevant information is entered. For example you could determine which medical procedures you want to see data for, and then only retrieve those specific records.

WHITEPAPER 4 DATA SOURCES The data landscape is changing rapidly. Hadoop was a groundbreaker a few years ago and continues to reinvent itself. Other technologies, such as Hive on Tez, are continuously improved. Spark SQL offers fast direct query access and compatibility with Spark. Cloudera Impala is still one of the most popular ways to get fast answers on Hadoop. Vendors such as AtScale are tackling the demand for faster answers with cubes on Hadoop. Facebook s contribution to the Hadoop community, their database Presto, has quickly been adopted by other large companies such as Netflix, LinkedIn, and Airbnb who are now also heavily supported by Teradata. We also see that traditional database vendors such as Teradata are extending their offering with Hadoop support. For example, vendors are offering connectors allowing users to query and easily move data between their database and Hadoop. Last by not least, data is moving to the Cloud. Many new companies wouldn t even consider managing their own data storage platforms and are instead choosing offerings like Google BigQuery, Amazon Redshift, and Databricks Cloud. We also see that companies enabled by secure integration between on-premises and cloud data are moving certain use cases to the cloud, such as to expose parts of their data to partners. In-Memory Advanced Analytics In-Datasource Datasource On-Demand Three Keys to Data Visualization and Discovery on Big Data: (1) In-memory architecture moves row-level data to the analytics platform. (2) Analysis of data in a database returns aggregation stats. (3) On-demand capabilities dynamically swap data in the out of memory based on the user s selection. CONCLUSION In summary, the key to visual data discovery on big data is to access data in a combination of different ways at the same time, from the same analysis or dashboard. But even with faster and more optimized databases and query engines there is still a lot of data to analyze and business users want and need fast analyses. Connecting to big data sources with the right combination of in-memory, indatabase, and on-demand techniques is key. With Spotfire, besides visualizing big data, advanced analytics such as statistical models, are also possible. The statistical engine, TIBCO Enterprise Runtime for R (TERR), enables this. The key to advanced analytics is, again, the combination of in-datasource and in-memory techniques. Often, relevant data is analyzed and computed in-datasource and then brought into Spotfire and enriched with the powerful in-memory expression capabilities of TERR. For more information, see references below.

WHITEPAPER 5 REFERENCES For specific Spotfire connector requirements: http://spotfire.tibco.com/solutions/technology/big-data For additional details and systems requirements for any of the Spotfire data access capabilities, visit the TIBCO Spotfire Systems Requirements page: http://support.spotfire.com/sr.asp Global Headquarters 3307 Hillview Avenue Palo Alto, CA 94304 +1 650-846-1000 TEL +1 800-420-8450 +1 650-846-1005 FAX www.tibco.com TIBCO Software empowers executives, developers, and business users with Fast Data solutions that make the right data available in real time for faster answers, better decisions, and smarter action. Over the past 15 years, thousands of businesses across the globe have relied on TIBCO technology to integrate their applications and ecosystems, analyze their data, and create real-time solutions. Learn how TIBCO turns data big or small into differentiation at www.tibco.com. 2015, TIBCO Software Inc. All rights reserved. TIBCO, the TIBCO logo, TIBCO Software, and Spotfire are trademarks or registered trademarks of TIBCO Software Inc. or its subsidiaries in the United States and/or other countries. All other product and company names and marks in this document are the property of their respective owners and mentioned for identification purposes only. 11/20/15