How to Navigate Big Data with Ad Hoc Visual Data Discovery Data technologies are rapidly changing, but principles of 30 years ago still apply today INTRODUCTION Data is the heart of TIBCO Spotfire. It s important to understand how to load it, but also how Spotfire consumes and processes it. Many think that step changes in the volume of data being stored and processed, as well as in new technologies, alter the approach to data access. The reality is that the same basic questions need to be asked whenever you consider working with any data source. Let s strip away all the technical jargon and formulate a real world scenario based on simple physics. You ll see that the principles of 30 years ago still apply today. BACK TO BASICS Suppose you have gone back 30 years and are working in an office as a data analyst in a small country. The records you need to analyze are stored downstairs in a series of filing cabinets. They are the medical records of 5 million citizens. Your boss comes along, and says he needs to know the average number of times over their entire life that a person visits a doctor. The question is simple, and the answer is a single floating point number. One way to find the answer would be to go down to the records department, make a copy of all the records related to doctor visits for every patient, cart those back up to your office, and count them. Another way would be to call down to Records and ask them to scan through the records, compile the result, and send it up to you.
WHITEPAPER 2 The key to visual data discovery on big data is to access data in a combination of different ways at the same time, from the same analysis or dashboard. What are some initial conclusions we can draw and questions that remain for each method? The first method seems awfully wasteful. You haul all those records up the elevator just to count them, when you could have done that downstairs. However: Once you have the records, you can answer other questions without going back downstairs. How will you know when new records are available, and how will you get them? If the records department closes, you can still work on your copies. How will the size and speed of the elevator impact your work? How much office space will you need to store and work on the records? What security precautions will you need to take for this sensitive information? The second method seems much more efficient. However: How many other requests does the records department have to fill? What happens as the questions become more complex, such as requiring segmentation by age and city? What happens if you ask the records department to run a Bayesian Inference? Enterprise level database systems are often capable of holding petabytes of data and quickly and efficiently running queries against them. However, not all databases are created equal, so it s a good idea to run some test queries before committing to this mode of operation. IN-MEMORY In in-memory mode, Spotfire reads all the raw data from a database, file, or system into its own internal memory. It then sorts the data into a format that allows it to do the calculations required for fast and efficient visualizations. Hopefully you can see this technique is analogous to our first method of accessing the medical records. With this is mind, let s revisit the previous pro s and con s: The in-memory method seems awfully wasteful because you copy all the data across the network. However: Do you want to wait as long as it may take to load the data? Once you have the data, you can answer other questions without going back to the original database or system. How will you know when new data is available, and how will you get it? If you disconnect from the network, you can still work on your copy of the data. How will the size and speed of the network impact your work? How much memory will you need to store the data and work on it? What security precautions will you need for this sensitive information? One obvious limitation is that you cannot read more data into memory than available capacity allows. On a desktop or laptop computer you might have 4, 8, or 16 gigabytes of memory. On servers using TIBCO Spotfire Web Player or TIBCO Spotfire Automation Services, there may be 32, 64, or more gigabytes of available memory. Spotfire is capable of loading over 100 gigabytes of data, but it will take some time.
WHITEPAPER 3 IN-DATASOURCE In in-datasource mode, Spotfire leaves the data being analyzed in the source database or system. It then sends questions to that database or system and uses the results to display data visualizations. Hopefully you can also see that this is analogous to our second method of accessing the medical records. Once again, let s revisit the previous pro s and con s: This method seems much more efficient. However: How many threads does the database engine have available to process your requests? How will performance degrade as the questions become more complex, such as requiring segmentation by age and city? What will happen if you ask the database to run a Bayesian Inference? Enterprise level database systems are often capable of holding petabytes of data and quickly and efficiently running queries against them. However, not all databases are created equal, so it s a good idea to run some test queries before committing to this mode of operation. Another approach that falls into this mode of data retrieval is to pre-create a set of query answers on a regular basis so that the most common questions can be answered quickly and efficiently. The most common method of implementing this is often called a cube. Cubes are pre-built with specific dimensions and measures that contain pre-built answers to common questions. Spotfire is able to use cubes to visualize data, provided the dimensions and measures match what you want to visualize. Spotfire comes with connectors for a wide range of data sources, which allows it to work in-datasource. Connectors are, in most cases, enabled when the corresponding driver has been installed. The driver is needed for TIBCO Spotfire Analyst on PC and for TIBCO Spotfire Consumer and TIBCO Spotfire Business Author on the Spotfire Web Player Server. As we have seen, the ability to do calculations depends on the data source doing the job. Relational data sources have differing capabilities, and cubes, with their pre-calculated measures, are in many ways very different from relational data sources. Spotfire accommodates all these different technologies. DATA-ON-DEMAND In addition to the two options already mentioned, Spotfire offers a third option for working with data, a hybrid of the other two. Using the data-on-demand mode, data is retrieved from the source system when it is needed. Consider our medical records example. Suppose your initial dashboard visualizations are setup to display the average number of doctor s visits using either in-memory or in-datasource data access. You identify a number of outliers, and now you want to drill into the actual visit notes to see if you can explain your observations. You don t want to bring the text history of all the patients into memory or retrieve them through an in-datasource query because that would mean a huge amount of extra, irrelevant data. You only want to drill into the outliers. So, you retrieve those extra pieces of data on-demand, when you need them. When you re finished reading them, you discard them to free up memory. This is data-on-demand as a master-detail pattern. This is the most common pattern, but another useful pattern is to ask some questions and then load the data once all the relevant information is entered. For example you could determine which medical procedures you want to see data for, and then only retrieve those specific records.
WHITEPAPER 4 DATA SOURCES The data landscape is changing rapidly. Hadoop was a groundbreaker a few years ago and continues to reinvent itself. Other technologies, such as Hive on Tez, are continuously improved. Spark SQL offers fast direct query access and compatibility with Spark. Cloudera Impala is still one of the most popular ways to get fast answers on Hadoop. Vendors such as AtScale are tackling the demand for faster answers with cubes on Hadoop. Facebook s contribution to the Hadoop community, their database Presto, has quickly been adopted by other large companies such as Netflix, LinkedIn, and Airbnb who are now also heavily supported by Teradata. We also see that traditional database vendors such as Teradata are extending their offering with Hadoop support. For example, vendors are offering connectors allowing users to query and easily move data between their database and Hadoop. Last by not least, data is moving to the Cloud. Many new companies wouldn t even consider managing their own data storage platforms and are instead choosing offerings like Google BigQuery, Amazon Redshift, and Databricks Cloud. We also see that companies enabled by secure integration between on-premises and cloud data are moving certain use cases to the cloud, such as to expose parts of their data to partners. In-Memory Advanced Analytics In-Datasource Datasource On-Demand Three Keys to Data Visualization and Discovery on Big Data: (1) In-memory architecture moves row-level data to the analytics platform. (2) Analysis of data in a database returns aggregation stats. (3) On-demand capabilities dynamically swap data in the out of memory based on the user s selection. CONCLUSION In summary, the key to visual data discovery on big data is to access data in a combination of different ways at the same time, from the same analysis or dashboard. But even with faster and more optimized databases and query engines there is still a lot of data to analyze and business users want and need fast analyses. Connecting to big data sources with the right combination of in-memory, indatabase, and on-demand techniques is key. With Spotfire, besides visualizing big data, advanced analytics such as statistical models, are also possible. The statistical engine, TIBCO Enterprise Runtime for R (TERR), enables this. The key to advanced analytics is, again, the combination of in-datasource and in-memory techniques. Often, relevant data is analyzed and computed in-datasource and then brought into Spotfire and enriched with the powerful in-memory expression capabilities of TERR. For more information, see references below.
WHITEPAPER 5 REFERENCES For specific Spotfire connector requirements: http://spotfire.tibco.com/solutions/technology/big-data For additional details and systems requirements for any of the Spotfire data access capabilities, visit the TIBCO Spotfire Systems Requirements page: http://support.spotfire.com/sr.asp Global Headquarters 3307 Hillview Avenue Palo Alto, CA 94304 +1 650-846-1000 TEL +1 800-420-8450 +1 650-846-1005 FAX www.tibco.com TIBCO Software empowers executives, developers, and business users with Fast Data solutions that make the right data available in real time for faster answers, better decisions, and smarter action. Over the past 15 years, thousands of businesses across the globe have relied on TIBCO technology to integrate their applications and ecosystems, analyze their data, and create real-time solutions. Learn how TIBCO turns data big or small into differentiation at www.tibco.com. 2015, TIBCO Software Inc. All rights reserved. TIBCO, the TIBCO logo, TIBCO Software, and Spotfire are trademarks or registered trademarks of TIBCO Software Inc. or its subsidiaries in the United States and/or other countries. All other product and company names and marks in this document are the property of their respective owners and mentioned for identification purposes only. 11/20/15