Trelliscope: A System for Detailed Visualization in the Deep Analysis of Large Complex Data
|
|
|
- Curtis Shepherd
- 9 years ago
- Views:
Transcription
1 Trelliscope: A System for Detailed Visualization in the Deep Analysis of Large Complex Data Ryan Hafen Karin Rodland Luke Gosink Kerstin Kleese-Van Dam Jason McDermott William S. Cleveland Purdue University Figure 1: Screenshot of the Trelliscope viewer showing a display created during analysis of high intensity physics data. In the configuration shown, 6 panels are displayed per screen, which are sorted and filtered according the parameters specified in the cognostics pane. ABSTRACT Trelliscope emanates from the Trellis Display framework for visualization and the Divide and Recombine (D&R) approach to analyzing large complex data. In Trellis, the data are broken up into subsets, a visualization method is applied to each subset, and the display result is an array of panels, one per subset. This is a powerful framework for visualization of data, both small and large. In D&R, the data are broken up into subsets, and any analytic method from statistics and machine learning is applied to each subset independently. Then the outputs are recombined. This provides not only a powerful framework for analysis, but also feasible and practical computations using distributed computational facilities. It enables deep analysis of the data: study of both data summaries as well as the detailed data at their finest granularity. This is critical to full understanding of the data. It also enables the analyst to program using an interactive high-level language for data analysis such as R, which allows the analyst to focus more on the data and less on code. In this paper we introduce Trelliscope, a system that scales Trellis to large complex data. It provides a way to create displays with a very large number of panels and an interactive viewer that allows the analyst to sort, filter, and sample the panels in a meaningful way. We discuss the underlying principles, design, and scalable architecture of Trelliscope, and illustrate its use on three analysis projects in proteomics, high intensity physics, and power systems engineering. Index Terms: G.3 [Mathematics of Computing]: Probability and [email protected] Statistics Statistical Computing; H.2.m [Database Management]: Database Applications Data mining 1 INTRODUCTION The Trellis visualization framework [2] provides an effective approach to detailed visualization of complex data. It has been widely used for data analysis as a way to meaningfully visualize both summary and detailed data. Currently, the most used implementation is the R package, Lattice [17]. In Trellis Display, the data are divided into subsets based on conditioning variables, and a plotting method is applied to each subset. The resulting plot for each subset is displayed on a panel. Current Trellis implementations render the panels on a single display that takes the form of a three-dimensional array with rows, columns, and pages. The whole process is controlled by a number of display methods that can be simply specified by the user. Implementations have provided a programming capability using the graphics language of the system that allows a user to specify almost any visualization method as the panel method. All of this enables a user to uncover the structure of data even when the structure is quite complicated. Trellis Display is discussed in Section 2. Like all other analytic methods for data analysis, Trellis Display becomes difficult to use effectively for large complex data. There are two problems: (1) how to efficiently generate and organize a large number of panels; and (2) how to view and interact with many panels. Trelliscope is an extension of Trellis that solves these problems. The extensions are described in Section 3. Problem (1) is solved by employing the Divide and Recombine (D&R) approach to large complex data, which is discussed in Section 3.1. Problem (2) is solved by making methods of sampling subsets a part of Trelliscope. We describe sampling and the use of cognostics in Section
2 3.2. Then, Section 4 introduces the design and implementation of Trelliscope. Section 5 illustrates its usage in three applications. 2 TRELLIS DISPLAY Divide and Recombine (D&R), Trellis Display, and Trelliscope all have as a fundamental aspect a division of data into subsets, and then a recombination. We begin with an illustration of division and Trellis Display with a very small dataset. Figure 2 is a Trellis Display of data from a 1930s experiment at the Minnesota Agricultural Experiment Station. At 6 sites, 10 varieties of barley were grown in each of 2 years. The data collected for the experiment are the yields for all combinations of the three factors site, variety, and year so there are 120 observations. In the figure, there are two conditioning variables: site and year. Site has 6 values and year has 2, so there are 12 combinations of site and year. The data are divided into 12 subsets by the 12 combinations. The visualization analytic method applied to each subset is a dotplot. Each panel of the display is this dotplot for one subset. This Trellis display allows us to study the the dependence of yield on variety conditional on site and year, and how the relationship changes with the values of the conditioning variables. This turns out to be an exceptionally powerful mechanism for analysis of data, both small and large. Often, a number of Trellis displays are made with different conditioning variables; for example, we can plot yield and site given variety and year. The barley data demonstrate the power of D&R. They were published by R. A. Fisher in his book, Statistical Methods for Research Workers, and through the decades became a canonical dataset used to illustrate many new numerical statistical methods. Despite the many re-analyses through time, it was not until the use of Trellis Display in their analysis 60 years later that a major error in the reported data was discovered [6]. This display gives a first clue, an anomaly at Morris. Grand Rapids Trebi Wisc. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Grand Rapids Trebi Wisc. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Duluth Duluth Univ. Farm Univ. Farm yield Morris Morris Crookston Crookston Figure 2: A Trellis display of the barley data. Waseca Waseca There are many visualization methods that are employed to enhance the effectiveness of Trellis Display. While these choices are largely up to the user to specify, a good Trellis implementation will provide mechanisms to specify and explore these choices with ease. More details on these and other considerations can be found in [11] and [2]. 3 EXTENDING TRELLIS Trelliscope extends Trellis Display to large complex data, enabling the efficient generation of displays with a very large number of panels that can be easily and effectively viewed. To achieve this we use the Divide and Recombine (D&R) approach to large complex data to generate the displays, and sampling of subsets to make the number of panels manageable. 3.1 Divide and Recombine Introduction Divide and Recombine (D&R) is a statistical approach to the analysis of large complex data [10]. The data are divided into subsets. Analytic methods are applied to each of the subsets, and the outputs of each method are recombined to form a result for the entire data. This provides computational feasibility and practicality. There are two categories of analytic methods: statistical methods (including machine learning methods), whose output is numeric and categorical, and visualization methods, whose output is visual. For a visualization method, the recombination is a visual display that assembles the panels for viewing across subsets. For a statistical analytic method, the recombination results in numeric and categorical values. For example, suppose we carry out logistic linear regression on subsets. The outputs are the estimates of the regression coefficients, and statistical information about the estimates. The recombination can be unweighted means of the subset coefficient estimates, or means weighted by estimates of their variances. The D&R division and recombination methods used for an analytic method are critical to the success of the D&R result. We seek best division and recombination methods that suit the analysis task at hand. In many cases, we are interested in studying the behavior of individual subsets of the data partitioned by categorical variables. An example is the visualization of the barley data, where each subset is determined by the year and site. Divisions can also be chosen such that each subset is a random partition of the data, where each subset can be thought of as an experiment. With such divisions, in many cases we can obtain analytic recombination results that serve as approximations that are very close or exactly the same as what we could have gotten had we been able to apply the analytic method to all of the data directly Two Goals One goal for D&R is deep analysis of large complex data: detailed, comprehensive analysis that does not lose important information in the data. Deep analysis requires visualization of the data at their finest granularity, not just visualization of summaries. This was established in the 1960s by the pioneers of data visualization for model building such as John Tukey, Frank Anscombe, Cuthbert Daniel, and George Box. We can do this with small data. For large complex data, should we wave the white flag, surrendering to the size and complexity, and visualize just summaries? Would that not be a step backward? Large complex data should not get a pass. A second goal is the analyst being able to use a well-designed interactive language for data analysis. This can reduce substantially the time an analyst spends programming with the data compared with using a lower-level language. There is a very clear metric: the fraction of the time the analyst spends thinking about programming versus the fraction of time spent thinking about the data. We seek to minimize the former. The language also needs to be powerful, allowing a tailoring of the analysis to the data. The language needs to provide access to the 1000s of statistical and analytic methods that exist today. R [15], does a very good job in efficiency for the analyst, programming power, and accessing methods The Computational Environment What makes D&R work is that the computations for the application of an analytic method to the subsets can be run in parallel with no communication among them. D&R can exploit parallel distributed computational environments like Hadoop with its MapReduce compute engine and Hadoop distributed file system. D&R makes it feasible for a data analyst to apply almost any statistical or visualization method to large complex data. Of course, the data analyst does not want to program in Hadoop s native language, Java. RHIPE, the R and Hadoop Integrated Programming Environment, enables the analyst to write Hadoop
3 MapReduce jobs completely in R [9]. RHIPE has two parts. A core, written in R and Java, that communicates with Hadoop. The second part is a prototype domain-specific language for D&R in R, available as a package named datadr [13]. This software provides a simple interface for specifying division and recombination methods. Thedatadr methods provide a layer on top of RHIPE, allowing the analyst to only think in terms of meaningful division and recombination methods, hiding the details of MapReduce from the user. Also,datadr has been designed with the goal in mind of linking to other computing and storage backends. Trelliscope has been implemented in the the D&R computational environment of R and datadr. There are a number of potential backends. Thus far we have used RHIPE/Hadoop, which provides for scaling to large complex data, efficiently creating Trelliscope displays with many panels. 3.2 Sampling Panels with Cognostics A natural approach to viewing many panels is simply scrolling through all pages of panels tiled across a screen. This alone can be very powerful, but can become unrealistic when the number of panels is large. In our own applications, the number of subsets range from thousands to millions. Our approach to viewing a large number of panels in a meaningful way is to compute metrics on the data being plotted in each panel that identify interesting qualitative or quantitative attributes, and then sample, filter, or sort the panels based on these metrics. Here, we loosely refer to any such metric as a cognostic, a Tukey term for computer guiding diagnostics [5]. What constitutes an interesting attribute of a subset? In most cases, this is left up to the analyst to determine for the visualization at hand. A cognostic metric is simply the result of any computation the analyst can think of that exploits some behavior in the data being visualized. Examples are simple statistical metrics such as the mean, standard deviation, quantiles, range, and number of observations. If a model fit is being plotted, a goodness-of-fit metric for the model fit for the subset can be a meaningful cognostic. Other examples are the percentage of values exceeding some limit, number of times a specific value is seen, difference between means of two groups of data, etc. There is no specific set of steps for specifying these types of cognostics. Finding meaningful cognostics is typically an iterative process. The analyst creates a panel plotting function, applies it to a few subsets, studies the resulting plots, and determines a candidate set of cognostics that might provide a meaningful space for navigating the collection of all panels. Then the analyst applies the panel and cognostics function to all subsets and views the results. Cognostics can also come from work in the interesting related area of research on scatterplot diagnostics, or scagnostics [21]. Scagnostics are cognostics specifically for scatterplots in which algorithms provide metrics for specific types of interesting behavior, such as outlying, skewed, clumpy, striated, etc. [21]. These cognostic metrics have more of a flavor of the computer trying to tell the analyst what is interesting in a scatterplot (probably more true to what Tukey had in mind), as opposed to the analystdefined cognostics described previously. Cognostics are specified and attached to a Trelliscope display within the programming environment, as they require specification of potentially complex algorithms to return the desired metrics. The analyst provides a function that takes a subset of data as input and returns a list of cognostic attributes. An example of how a display is created along with the cognostics is shown in Section 4. There are different ways cognostics can be used to interact with the panels of a Trelliscope display. For very exploratory views of the display, a panel viewer can present the set of cognostics to the analyst and allow them to sort and filter the panels based on the different cognostic attributes. This results in a focused sampling of panels based on different metrics of interest. Our implementation of such a Trelliscope viewer is described in Section 4. In cases where we would like to conduct a more controlled study of the panels of a large display, we can take representative samples of panels based on distributions of cognostics. A representative sample covers the joint region of values of a set of cognostic variables. In most cases the variables are known to have an effect, and we seek to study their effect within the region of the multidimensional space that they collectively occupy. For example, suppose 10 6 subsets have variable sample sizes n. Suppose from each subset we get an estimate of a parameter, and 5000 bootstrap values of the estimate to describe the sampling distribution of the estimator for the subset. Suppose we want to study the normality of the bootstrap distribution for each subset by a normal quantile plot, and how it changes with n. Of course our expectation is that the estimator gets closer to normal as n increases. We could check normality by making normal quantile plots for 1000 subsets whose values of n on the log scale are as close to equal as possible. Such representative sampling could be specified either in a Trelliscope viewer or from the analysis programming environment. 4 DESIGN AND IMPLEMENTATION OF TRELLISCOPE The Trellis extension concepts introduced in the previous section are quite straightforward the idea is to use a distributed computing interface to create the panels and leverage cognostics and sampling to interactively view the panels. However, inserting these ideas into an implementation that provides meaningful interactivity and an efficient workflow for a data analyst, while being interoperable with a distributed storage and computing backend, is not trivial. We have created an R implementation of the Trellis extensions, called Trelliscope. As a system, Trelliscope builds upon D&R by providing methods to apply panel and cognostic functions to divided datasets to create displays with many panels, providing a web-based viewer that allows users to interact with the panels, all with minimal effort. For example, users can sort and filter millions of panels using a wide variety of cognostics values. The filtered panels are tiled on a screen where the user can then rapidly page through them. To support this at-scale flexibility for visualization, Trelliscope takes as input a dataset divided into subsets using the divide() method from thedatadr D&R package [13]. This data can either be a divided local data frame or a divided dataset on the Hadoop distributed file system (HDFS). A prepanel function can be applied to the data to determine axis limits for each panel. If limits are not specified, scales are chosen such that the data fills the panel. The analyst then specifies a plot function and a cognostics function to be applied to each subset and sends these to a command makedisplay(). This function applies the panel and cognostics functions to each subset and stores the results. Trelliscope employs R as a visualization system to create the plots and compute the cognostics. By using the R-based interfaces of base R graphics, lattice, or ggplot [20, 17], it is possible to create nearly any visual representation that a user requires. There are several options for how plots and cognostics are stored, which are discussed in greater detail later in Sections and Hardware Architecture Trelliscope can work on a single workstation operating on inmemory data frames in R. When dealing with very large data sets, Trelliscope can interface with scalable, distributed hardware architectures. For example, Figure 3 shows how Trelliscope can operate seamlessly with very large datasets stored on HDFS. Our analysis environment has been set up in a way similar to that of Figure 3, except that the web server and cognostics server are both running on the master node of the Hadoop cluster.
4 will provide the ability to index columns of the database for fast filtering, while not being required to be loaded into memory on the web server. We have chosen MongoDB as a database option for storing cognostics in our system. MongoDB scales very easily and has a flexible aggregation framework that we plan to leverage to perform different viewer-directed transformations and aggregations of cognostics. Figure 3: Distributed architecture of Trelliscope. Trelliscope is capable of running on a single workstation, but has the ability to run in a distributed architecture, one possibility of which is shown here. All analysis work is done on the analyst s workstation in R, and the D&R interface and Trelliscope R package coordinate communication with the other services. Other users, such as subject-matter experts (SMEs) can access the viewer through a web browser Panel Rendering We have investigated and implemented several options for panel rendering and storage. There are two major classes of panel rendering possibilities: 1) prerender and store raster images on disk, HDFS, or MongoDB; 2) render panel images on-the-fly. In the latter case, pointers to each subset on disk are stored in the image s place and the images are rendered on-the-fly by reading the associated data when requested by the viewer. The advantage of rendering on-the-fly is much quicker initial creation of a display (only cognostics are computed), as well as the ability to dynamically modify the panel function from within the viewer. This particularly makes sense when there are millions of subsets and the probability of viewing all subsets is very low, in which case there is not much sense in precomputing all of them. It typically is not expensive in terms of compute time or storage space to compute cognostics for every subset, even when the number of subsets is in the millions. The only possible downside to the render-on-the-fly approach is the case when the plot function is very expensive to apply. In that case, it can be worth the time expense to pre-generate the plots so that time isn t spent on rendering during the viewing process. In our implementation, the viewer knows how to link to a panel image or data through a special cognostic automatically computed for each display called panelkey. Panel images or data can be uniquely located through the combination of the display name and panel key. Panel data or images stored on HDFS are stored as key/value pair Hadoop mapfiles [19] with the key being the panel key. The images or data can then be rapidly retrieved by key when the viewer requests them. For panel images stored on disk, the files are named according to panel key and are retrievable by name Cognostics Storage Cognostics for a display can be stored either as an R data frame on disk or as a collection in a database. The R data frame storage approach is the simplest and does not require additional hardware or software to be available aside from R on a workstation. The potential drawback to this approach is that it does not scale well beyond a few million subsets as the viewer must load the entire cognostics data frame into memory to show the display. Storing the cognostics in a database is another option that provides several benefits over a data frame approach. Most databases 4.2 Creating a Display in R Creating a display from an R instance is as simple as calling the makedisplay() function and specifying 1) a divided input dataset, 2) a name for the display, and 3) a plot function for the panels. Other optional settings (e.g., display group, axis limits, and a cognostics function) can be specified as well. With respect to data types, makedisplay() can handle different classes of divided data objects (e.g. local data frame or HDFS) and there are different parameter defaults for how plots and cognostics are stored based on the input type. Detailed documentation about all of the options with examples is provided in the online documentation [14]. An example of creating a display could look like this: makedisplay( data = gridcounts, group = "exploratory", name = "gridcounts", desc = "Number of grid hits over time", lims = list(x="free", y="free"), plotfn = gcplot, cogfn = gccog, plotdim = list(height = 200, width = 800), storage = "hdfsdata") Here, gridcounts is a divided HDFS data object. We have created and tested a plot function gcplot and a cognostics functiongccog that we are specifying to be applied to the each subset of this data. We specify the x and y axis limits to be"free". The storage option is"hdfsdata", meaning that only cognostics will be computed, including apanelkey that will be used to fetch data and render on-the-fly in the viewer. Prior to calling makedisplay(), there are functions that aid the analyst in specifying the panel axis limits. A prepanel() function computes the axis limits for each subset and returns them. The analyst can visualize the limits with a special R plot method to get a feel for the distribution of the axis limit values. This can be helpful in discovering subsets with large outlying values which are sure to abound when the data is very large, and the analyst can specify to trim outliers so that they do not inflate the axis limits of all panels. 4.3 The Trelliscope Viewer The Trelliscope viewer is a web application written in R using the R Shiny [16] package. The server for the viewer can either run on the user s workstation or on a web server. The viewer is accessed through a web browser. Shiny provides communication from the web browser to a live R session on the web server through a WebSocket connection, providing an interactive user interface controlled by JavaScript Panel View The panel view is the main component of the viewer; Figure 4 shows the panel view for a display created during the analysis of high intensity physics data where the user has specified to view six panels per screen. Once the user has selected a display to view, the panel view tiles the panels of the chosen display across the screen. Left and right keys or the navigation buttons on the header bar provide simple navigation through the collection of panels. By clicking the view button, the analyst can specify how many panels are to be viewed in a single screen and can also specify
5 Figure 4: Screenshot of the panel view of the Trelliscope viewer for a display created during the analysis of high intensity physics data. whether cognostic values should be displayed alongside each panel. There is also the view option to edit the panel plotting function for displays that are not pre-rendered. This ability to dynamically change what is being plotted is an additional benefit to not prerendering the plots. The user can react to what is being seen in the panels directly through the viewer (e.g. adding a reference line) without going back to R to recreate the entire display. An example screenshot of the view pane is shown in Figure 5. Figure 6: Screenshot of the cognostics pane in the Trelliscope viewer. In addition to panelkey, 4 cognostics variables are visible. The time cognostic has been range filtered and the filtered cognostics are sorted in reverse order by the cognostic variable n. that it can help the user identify regions of interest in the cognostics space using the distribution of the cognostic as a visual cue. The visual filters are created using d3.js [3]. For univariate filtering, the user can click the histograms at the bottom of the cognostics table (see Figure 6). This activates the histogram so that users can select a range of the histogram s values; a screenshot of this type of selection is shown in Figure 7. Figure 5: Screenshot of view pane of the Trelliscope viewer. The panel plot editing tab is visible Cognostics Interaction A major feature of the Trelliscope viewer is the ability to sort and filter panels according to the cognostics. Clicking the Cog button brings up an interactive table that displays the cognostics variables. Clicking column headers in the cognostics panel allows for sorting, with multi-column sorting available with shift-click. In addition to sorting, there are filters available for each variable. Regular expressions can be used to filter qualitative variables and numerical ranges can be used to filter quantitative variables. A screenshot of the cognostics pane for a display is shown in Figure 6. This figure depicts an example of filtering and sorting where operations done in the cognostics pane determine how the panels will be displayed back in the panel view. The cognostics pane also allows for interactive visual filtering of cognostics for quantitative variables. Visual filtering is useful in Figure 7: Screenshot of the interactive univariate histogram cognostic filter in the Trelliscope viewer. Multivariate filtering is also possible. In the cognostics pane s Multi Filter section, the user can select two variables and bring up an interactive scatterplot from which a bivariate range of values can be identified to filter on. If there are more than 5,000 cognostics, a hexagonal binning [4] is computed and plotted instead of individual points. In the future, the viewer will support more than two variables by using projection pursuit to project the multivariate data onto a two-dimensional plane which the user can then interact with. An example of bivariate filtering is shown in Figure Viewing related displays It is very common to create multiple displays for the same division of the data. When this is the case, it is useful to view multiple
6 Figure 8: Screenshot of the interactive bivariate cognostic filter in the Trelliscope viewer. displays side-by-side for each subset. Trelliscope provides a way to do this by keeping track of which displays were created from the same division. By selecting the Displays button and selecting Related Displays from the menu, the user can select any number of related displays to be shown in the viewer. 5 APPLICATIONS We have employed the principles behind Trelliscope in many projects [12, 22]. Here, we discuss three recent projects that have made direct use of the Trelliscope implementation. The main goal of this paper is to introduce the Trelliscope methodology and implementation, and therefore we are not able to go into details for these applications. Our aim is to briefly illustrate the use of Trelliscope in real-world data analysis projects and demonstrate a few use cases where it was an integral part of deep analysis. 5.1 Power Systems Data Cleaning and Event Detection Sensors called Phasor Measurement Units (PMUs) provide power grid operators with the status of the electric power grid at various geographical locations at a very high frequency, times per second. Our analysis covered a data set of 1.5 years of data from 38 PMUs, a data set 2TB in size. The data consists of about 1.5 billion time points with measurements from 38 PMU locations, where each PMU records on average 14 measurements. These measurements include the measured frequency at which the grid is operating as well as several phase angle measurements. The initial goal for our analysis was to build event detection algorithms for specific grid activities of interest such as generator trips and grid islanding. Our initial inquiry for the data was simply to understand the behavior of each series over time. As all behaviors of interest are time-local, we chose a subsetting scheme of time intervals that are short enough to be able to process and visualize at high enough time resolution, but long enough to ensure that we capture the time-local behavior within subsets. Based on these requirements, we settled on 5 minute time intervals for subsetting. There are about 158,000 subsets of the data, and when additionally splitting out by location, there are about 6 million subsets. We investigated the data using Trelliscope to create and view time series plots of the subsets. Trelliscope provided a way for us to have immediately at our fingertips a detailed view of any time sequence of the 2TB dataset. Figure 9 shows two panels of one of the time series displays created. The frequency of the grid across a 5-minute segment of time is shown, with a series for each of the 38 PMUs overplotted. The frequency reported for each PMU generally follows the same trend over time. We iteratively developed a collection of useful cognostics for this display. Simple attributes that provide for temporal sorting and filtering include the start time of the time interval and derived attributes such as time of day, day of week, etc. Other simple attributes include the number of missing values and number of times the frequency exceeds a limit specified by power systems engineers to be alarming or physically impossible. The time series vary around a trend that changes over time. To capture that variability, we specified a cognostic that removes the trend by fitting a local polynomial regression to the time series and computes a robust measure of the standard deviation of residual values of this fit. This cognostic helps point us to subsets of the data where the variability around the trend is unusually large or small. Another useful cognostic we computed was the steepest slope of the fitted trend across the 5-minute interval, which helped us investigate time periods of extreme changes in frequency. Sorting and filtering on cognostic variables would typically bring subsets to our attention where other interesting behavior was occurring not captured by, but correlated with, the cognostics. Some of these behaviors were so rare that it is doubtful that they would have been discovered without the ability to interactively look at the data at the greatest level of detail. A large proportion of the behaviors we uncovered were deemed to be erroneous records. The combination of Trelliscope for detailed views with the analytical recombination methods of D&R enabled us to identify, validate, and build algorithms to remove these records from the data to focus on the event detection. Many of the types of erroneous records had gone unnoticed in several prior analyses where tools like D&R and Trelliscope were not available. Another notable use-case of Trelliscope was in fine-tuning our generator trip algorithm. Based on examples of known trips, we were able to identify characteristics of a generator trip: all frequency measurements experience a sharp sudden drop followed by a gradual recovery. We constructed features from the data that captured these types of behaviors, but were not certain at which point the combination of feature metrics constituted a bona fide generator trip. We used the metrics to procure a candidate set of over 3000 time periods in which we thought a generator trip might have occurred. This data constituted a new division with each subset containing the data corresponding to the suspected generator trip. We created a Trelliscope time series display of the data and presented them to a domain expert to view and classify using the Trelliscope viewer. Classification narrowed the list down to about 500 verified trips and from this narrowed list we were able to create a rulesbased algorithm for detecting generator trips. A screenshot of the Trelliscope viewer being used for this purpose is shown in Figure 9. Figure 9: Trelliscope viewer being used to view and verify generator trips in a power systems engineering application. 5.2 High Intensity Physics Particle Classification In a high intensity physics application, we are looking at detecting and differentiating between two sub-atomic particles, pion and
7 kaon, that are present during a single 100ns event from simulated data that models a detector in the Belle II experiment [8]. Within a single event, the numbers of these particles, their ratio to each other, and the energy distribution of each particle type is key to identifying inconsistencies in the Standard Model of particle physics. This is an ongoing project, but our initial task, in anticipation of real data becoming available in the coming years, is to see if it is feasible to do kaon / pion classification from simulated data. We are able to stochastically simulate data from a variety of possible scenarios. From a high-level, the scenario of interest begins when a kaon or pion particle hits the exterior of the detector and releases a small array of photons into one of the detector s quartz chamber The particles themselves enter the detector at a known momentum, P, and propagate photons radially at an unknown angle, θ. The particles bounce around the detector until they hit a sensor grid defined by 64 8 channels (labeled 1,..., 512). We have simulated hundreds of thousands of events from 168 combinations of P, θ, and particle type (kaon/pion). The simulation reports the hit time and location on the sensor. The complete realizations we are currently studying generate about 12GB of data but we will soon be looking at much larger collections of events. This is not nearly as much data as in our previous example, but still enough that we cannot tackle this data with traditional approaches. In our first step, we divide the data into subsets for each combination of P, θ, and particle type in order to study the distribution of hit times and locations on the sensor grid according to channel. This results in a manageable 168 subsets. One Trelliscope display we created on this division is a hexagonal binned scatterplot of time vs. channel. An example of this for pions with cos(θ)=0.1 and P ranging from 1 to 5 (left to right, top to bottom) is seen in Figure 4. Initially, cognostics for this display were simply specified to be the values of P, θ, and particle type that define the subset. We viewed the display in the Trelliscope viewer, choosing 2 panels per page and sorting the panels first based on P, then θ, then particle type. This allowed us to view the kaon and pion scatterplots side-by-side for fixed P and θ, and then page through all 168 panels making visual comparisons as we progress through different levels of θ and then P. Viewing in this way, we noticed that for small P, it is easy to visually distinguish kaons and pions, but as P increases (from P 3), the two begin to look very similar. To further investigate the kaon / pion behavior, we partitioned the data by time. More specifically, we choose one set of parameters, P=3 and cos(θ)=0.1 to study and contrast kaon / pion behavior in detail. An initial look at this case over time led us to desire a finer time resolution and hence more sample runs. We generated 1 million sample runs for this case and divided the data up into 3739 subsets according to equal-width log time intervals. One display we created for this data is a heat map of the hit distribution within each time chunk for each of kaon and pion, plotted according to the actual grid layout of the sensor. An example of this display can be seen in the background of Figure 1. For this display, we computed cognostics for the start time of the interval, the number of kaon hits, the number of pion hits, and the difference between kaon and pion hits. Our first view of this display was simply quickly paging through the panels in time order. This view helped us notice that there are time periods in which a clear signal is present on the sensor, which moves around on the sensor over time. The activity comes in bursts, and the signal is only present when the activity is high, verified by looking at the scatterplot of the cognostics of number of hits vs. time (see Figure 8). Viewing this display also gave us a better view of the differences between kaon and pion and we were able to see many times where the signal was visibly very different between the two. Based on our observations, we constructed a simple Bayes classifier for any new sample. The sample hits are matched to the time partitions of our time-divided data, and then based on the distribution of the training data in that time period, we compute the probability that the point is a kaon or pion given its hit location on the grid. We found that if we weight these classifications based on the time regions where the difference between kaon and pion is most pronounced, we can get accurate classification in the 90% range. While here we have omitted several of the iterations and steps in this process, Trelliscope and the datadr package played a crucial role in doing this analysis effectively in a very short time. 5.3 Proteomics Biomarker Discovery The goal of a proteomics research problem we have been involved with is to extract meaningful signatures of biological processes from peptide and protein abundance data. The process of biomarker discovery and characterization provides opportunities both for purely statistical and expert knowledge-based approaches. In this project we investigated the use of Trelliscope in an effort to improve integration of the two. We used proteomics data from a mouse model of chronic obstructive pulmonary disease (COPD) that involves deletion of a gene, ADA. Animals with ADA+/- are controls (no disease) ADA+/+ are diseased. We analyzed data from plasma samples of control and disease groups over time to study biomarkers for COPD. The abundance data underwent an extensive preprocessing procedure, resulting in data for 2,927 peptides which were rolled up into 414 proteins. Statistical and functional clustering tools were applied to the data, forming potential groups of biomarkers. We created a visual representation of the data in Trelliscope to present to a domain expert. We divided the protein data into subsets by protein and likewise for the peptide data. The panel function for each protein or peptide displays the abundance over time by disease and control groups. An example of two panels from the protein display is shown in Figure 10. The dots represent the abundances for each subject, while the crosshairs and lines denote the mean abundance at each time point for each group and its progression over time. The top and bottom sets of boxes colored in gray and green indicate whether a statistically significant difference exists between the two groups at each time point, the top row for a G-test (test for presence/absence) and the bottom for a t-test for difference between group means [18]. A statistically significant positive difference between control and disease for the given test at the given time point, gray means insignificant, and red means a statistically significant negative difference. These boxes helped create a visual cue that can be rapidly digested while paging through many panels. We computed several cognostics for each panel. Some simple statistical cognostics derived from the data in each subset include the following: number of total, ADA-/-, and ADA+/- observations in the panel, average difference between ADA+/- and ADA-/- before and after day 34, t-test and g-test p-values on each day, slope of ADA-/- and ADA+/- abundance over time. We merged external domain-specific data with our subsets to obtain cognostics providing more information about the protein, including protein annotation and function, cluster assignment from hierarchical clustering, enriched functional cluster assignment and gene functional classification from DAVID [7], and assigned generalized biological process from GO [1]. Using these cognsotics, data analysts and domain experts were able to interact with the panels in many ways. Panels were sorted and filtered on proteins of interest, protein functions of interest, etc. to help the domain expert understand what the data had to say. Panels were also sorted by statistical significance trends of interest, such as abundances that are not significantly different in the beginning and become significant over time. Another interaction was filtering the panels based on the clustering results and determining if the resulting protein groupings made any sense, as well as trying to determine if there was anything apparent in the data that would explain and provide some interpretability of the protein groupings
8 concentration ADA-/- ADA+/- AZI1_MOUSE G / / / / / t + + / / / DPOD1_MOUSE G + / + / / t / / / / / Figure 10: Two panels from a display of protein abundance vs. time for disease (ADA+/-) and control (ADA-/-) groups. selected. This analysis facilitated by Trelliscope provided valuable insight into the data for the domain experts in several key aspects. Visualization of the peptides (portions of the whole protein) contributing to quantitation of key proteins was evaluated. In several cases it was apparent that there were two trends in the peptides mapping to one protein indicating that the mapping was incorrect. Grouping of the proteins into functional groups was used to identify subgroups for incorporation into a classification algorithm that allowed identification of improved biosignatures of disease in the COPD samples. The data for this example is by far the smallest (less than 10MB) and illustrates the utility of Trelliscope when the data is not large. Detailed views of the data still required panels for thousands of subsets and thus benefitted from the interactivity of Trelliscope. day 6 DISCUSSION AND FUTURE WORK Trelliscope provides a framework for scalable detailed, interactive visualization of large complex data. It has been developed out of necessity from analysis of many real-world large complex datasets and has proven to be immensely useful. Trelliscope is a research effort that to this point has been mainly focused on the architecture and practicality of a scalable Trellis Display system. In the future, we would like to give much more attention to the design and features of the viewer. The main focus of future work on the viewer will be sampling and cognostics. We envision an interface that allows the analyst to specify different sampling schemes of the panels. We also anticipate new cognostic types that provide new ways to interact with the data. One of these is relational cognostics, where the cognostic for one panel is a reference to another panel or panels either in the same display or in other displays. Such cognostics could be displayed and interacted with in the viewer through a network graph. We also envision more options with how the quantitative and qualitative cognostics are interacted with. This would include more complex visual interactions with multiple cognostic variables. We have given consideration to the idea of using more modern web-based technologies to create and render the panel images. This could bring a new level of interactivity. However, we feel there is sufficient power (and ease of use) for the analyst by allowing them to stick to tools they are familiar with to create the panel-level plots while still being able to have rich interactions with these plots at the cognostics level. ACKNOWLEDGEMENTS This work was supported in part by the Pacific Northwest National Laboratory Signature Discovery Initiative, Future Power Grid Initiative, and a grant D&R for Large Complex Data from the DARPA XDATA Program. REFERENCES [1] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, et al. Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25 29, [2] R. A. Becker, W. S. Cleveland, and M.-J. Shyu. The visual design and control of trellis display. Journal of Computational and Graphical Statistics, 5(2): , [3] M. Bostock, V. Ogievetsky, and J. Heer. D 3 data-driven documents. Visualization and Computer Graphics, IEEE Transactions on, 17(12): , [4] D. B. Carr, R. J. Littlefield, W. Nicholson, and J. Littlefield. Scatterplot matrix techniques for large n. Journal of the American Statistical Association, 82(398): , [5] W. Cleveland. The Collected Works of John W. Tukey: Graphics , volume 5. Chapman & Hall/CRC, [6] W. S. Cleveland. Visualizing Data. Hobart Press, Chicago, [7] G. Dennis Jr, B. T. Sherman, D. A. Hosack, J. Yang, W. Gao, H. C. Lane, R. A. Lempicki, et al. David: database for annotation, visualization, and integrated discovery. Genome Biol, 4(5):P3, [8] T. et al. Belle II Technical Design Report, [9] S. Guha. Computing Environment for the Statistical Analysis of Large and Complex Data. PhD thesis, Purdue University Department of Statistics, [10] S. Guha, R. Hafen, J. Rounds, J. Xia, J. Li, B. Xi, and W. S. Cleveland. Large complex data: divide and recombine (d&r) with rhipe. Stat, 1(1):53 67, [11] S. Guha, R. P. Hafen, P. Kidwell, and W. S.Cleveland. Visualization Databases for the Analysis of Large Complex Datasets. Journal of Machine Learning Research, 5: , [12] S. Guha, P. Kidwell, A. Barthur, W. S. Cleveland, J. Gerth, and C. Bullard. A Streaming Statistical Algorithm for Detection of SSH Keystroke Packets in TCP Connections. In R. K. Wood and R. F. Dell, editors, Operations Research, Computing, and Homeland Defense. Institute for Operations Research and Management Sciences, [13] R. Hafen. datadr github documentation. Accessed: [14] R. Hafen. trelliscope github documentation. Accessed: [15] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN [16] RStudio and Inc. shiny: Web Application Framework for R, R package version [17] D. Sarkar. Lattice: multivariate data visualization with R. Springer Verlag, [18] B.-J. M. Webb-Robertson, L. A. McCue, K. M. Waters, M. M. Matzke, J. M. Jacobs, T. O. Metz, S. M. Varnum, and J. G. Pounds. Combined statistical analyses of peptide intensities and peptide occurrences improves identification of significant peptides from ms-based proteomics data. Journal of proteome research, 9(11): , [19] T. White. Hadoop: The definitive guide. O Reilly Media, Inc., [20] H. Wickham. ggplot2: elegant graphics for data analysis. Springer Publishing Company, Incorporated, [21] L. Wilkinson, A. Anand, and R. Grossman. High-dimensional visual analytics: Interactive exploration guided by pairwise views of point distributions. Visualization and Computer Graphics, IEEE Transactions on, 12(6): , [22] B. Xi, H. Chen, W. S. Cleveland, and T. Telkamp. Statistical Analysis and Modeling of Internet VoIP Traffic for Network Engineering. Electronic Journal of Statistics, 4:58 116, 2010.
EDA and ML A Perfect Pair for Large-Scale Data Analysis
EDA and ML A Perfect Pair for Large-Scale Data Analysis Ryan Hafen, Terence Critchlow Pacific Northwest National Laboratory Richland, WA USA [email protected], [email protected] In this position
Tutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
Data Visualization Handbook
SAP Lumira Data Visualization Handbook www.saplumira.com 1 Table of Content 3 Introduction 20 Ranking 4 Know Your Purpose 23 Part-to-Whole 5 Know Your Data 25 Distribution 9 Crafting Your Message 29 Correlation
Multivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER
Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER JMP Genomics Step-by-Step Guide to Bi-Parental Linkage Mapping Introduction JMP Genomics offers several tools for the creation of linkage maps
TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:
Table of contents: Access Data for Analysis Data file types Format assumptions Data from Excel Information links Add multiple data tables Create & Interpret Visualizations Table Pie Chart Cross Table Treemap
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Data Visualization Techniques
Data Visualization Techniques From Basics to Big Data with SAS Visual Analytics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Generating the Best Visualizations for Your Data... 2 The
Data Exploration Data Visualization
Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 1 What is data exploration? A preliminary
IBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
IBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Automated Learning and Data Visualization
1 Automated Learning and Data Visualization William S. Cleveland Department of Statistics Department of Computer Science Purdue University Methods of Statistics, Machine Learning, and Data Mining 2 Mathematical
Data exploration with Microsoft Excel: analysing more than one variable
Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical
Simple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
SPSS Manual for Introductory Applied Statistics: A Variable Approach
SPSS Manual for Introductory Applied Statistics: A Variable Approach John Gabrosek Department of Statistics Grand Valley State University Allendale, MI USA August 2013 2 Copyright 2013 John Gabrosek. All
Data Visualization Techniques
Data Visualization Techniques From Basics to Big Data with SAS Visual Analytics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Generating the Best Visualizations for Your Data... 2 The
R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol
R Graphics Cookbook Winston Chang Beijing Cambridge Farnham Koln Sebastopol O'REILLY Tokyo Table of Contents Preface ix 1. R Basics 1 1.1. Installing a Package 1 1.2. Loading a Package 2 1.3. Loading a
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Oracle Data Miner (Extension of SQL Developer 4.0)
An Oracle White Paper September 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Integrate Oracle R Enterprise Mining Algorithms into a workflow using the SQL Query node Denny Wong Oracle Data Mining
Clustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
MultiExperiment Viewer Quickstart Guide
MultiExperiment Viewer Quickstart Guide Table of Contents: I. Preface - 2 II. Installing MeV - 2 III. Opening a Data Set - 2 IV. Filtering - 6 V. Clustering a. HCL - 8 b. K-means - 11 VI. Modules a. T-test
Visualization Quick Guide
Visualization Quick Guide A best practice guide to help you find the right visualization for your data WHAT IS DOMO? Domo is a new form of business intelligence (BI) unlike anything before an executive
The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit
Introduction to Exploratory Data Analysis
Introduction to Exploratory Data Analysis A SpaceStat Software Tutorial Copyright 2013, BioMedware, Inc. (www.biomedware.com). All rights reserved. SpaceStat and BioMedware are trademarks of BioMedware,
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Exploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
BIG DATA VISUALIZATION. Team Impossible Peter Vilim, Sruthi Mayuram Krithivasan, Matt Burrough, and Ismini Lourentzou
BIG DATA VISUALIZATION Team Impossible Peter Vilim, Sruthi Mayuram Krithivasan, Matt Burrough, and Ismini Lourentzou Let s begin with a story Let s explore Yahoo s data! Dora the Data Explorer has a new
Usage Analysis Tools in SharePoint Products and Technologies
Usage Analysis Tools in SharePoint Products and Technologies Date published: June 9, 2004 Summary: Usage analysis allows you to track how websites on your server are being used. The Internet Information
Visualization methods for patent data
Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes
Visualization of missing values using the R-package VIM
Institut f. Statistik u. Wahrscheinlichkeitstheorie 040 Wien, Wiedner Hauptstr. 8-0/07 AUSTRIA http://www.statistik.tuwien.ac.at Visualization of missing values using the R-package VIM M. Templ and P.
OPNET Network Simulator
Simulations and Tools for Telecommunications 521365S: OPNET Network Simulator Jarmo Prokkola Research team leader, M. Sc. (Tech.) VTT Technical Research Centre of Finland Kaitoväylä 1, Oulu P.O. Box 1100,
Flexible Web Visualization for Alert-Based Network Security Analytics
Flexible Web Visualization for Alert-Based Network Security Analytics Lihua Hao 1, Christopher G. Healey 1, Steve E. Hutchinson 2 1 North Carolina State University, 2 U.S. Army Research Laboratory [email protected]
Principles of Data Visualization for Exploratory Data Analysis. Renee M. P. Teate. SYS 6023 Cognitive Systems Engineering April 28, 2015
Principles of Data Visualization for Exploratory Data Analysis Renee M. P. Teate SYS 6023 Cognitive Systems Engineering April 28, 2015 Introduction Exploratory Data Analysis (EDA) is the phase of analysis
Polynomial Neural Network Discovery Client User Guide
Polynomial Neural Network Discovery Client User Guide Version 1.3 Table of contents Table of contents...2 1. Introduction...3 1.1 Overview...3 1.2 PNN algorithm principles...3 1.3 Additional criteria...3
Data Analysis Tools. Tools for Summarizing Data
Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool
Best Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
CSU, Fresno - Institutional Research, Assessment and Planning - Dmitri Rogulkin
My presentation is about data visualization. How to use visual graphs and charts in order to explore data, discover meaning and report findings. The goal is to show that visual displays can be very effective
Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6
Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6 Overview This tutorial outlines how microrna data can be analyzed within Partek Genomics Suite. Additionally,
InfiniteGraph: The Distributed Graph Database
A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Data analysis and regression in Stata
Data analysis and regression in Stata This handout shows how the weekly beer sales series might be analyzed with Stata (the software package now used for teaching stats at Kellogg), for purposes of comparing
PROGRAM DIRECTOR: Arthur O Connor Email Contact: URL : THE PROGRAM Careers in Data Analytics Admissions Criteria CURRICULUM Program Requirements
Data Analytics (MS) PROGRAM DIRECTOR: Arthur O Connor CUNY School of Professional Studies 101 West 31 st Street, 7 th Floor New York, NY 10001 Email Contact: Arthur O Connor, [email protected] URL:
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Exercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
Visualization with Excel Tools and Microsoft Azure
Visualization with Excel Tools and Microsoft Azure Introduction Power Query and Power Map are add-ins that are available as free downloads from Microsoft to enhance the data access and data visualization
COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3
COMP 5318 Data Exploration and Analysis Chapter 3 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Chapter 4 Displaying and Describing Categorical Data
Chapter 4 Displaying and Describing Categorical Data Chapter Goals Learning Objectives This chapter presents three basic techniques for summarizing categorical data. After completing this chapter you should
Instructions for SPSS 21
1 Instructions for SPSS 21 1 Introduction... 2 1.1 Opening the SPSS program... 2 1.2 General... 2 2 Data inputting and processing... 2 2.1 Manual input and data processing... 2 2.2 Saving data... 3 2.3
SAS BI Dashboard 4.3. User's Guide. SAS Documentation
SAS BI Dashboard 4.3 User's Guide SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2010. SAS BI Dashboard 4.3: User s Guide. Cary, NC: SAS Institute
White Paper. Data Visualization Techniques. From Basics to Big Data With SAS Visual Analytics
White Paper Data Visualization Techniques From Basics to Big Data With SAS Visual Analytics Contents Introduction... 1 Tips to Get Started... 1 The Basics: Charting 101... 2 Line Graphs...2 Bar Charts...3
DeCyder Extended Data Analysis module Version 1.0
GE Healthcare DeCyder Extended Data Analysis module Version 1.0 Module for DeCyder 2D version 6.5 User Manual Contents 1 Introduction 1.1 Introduction... 7 1.2 The DeCyder EDA User Manual... 9 1.3 Getting
DHL Data Mining Project. Customer Segmentation with Clustering
DHL Data Mining Project Customer Segmentation with Clustering Timothy TAN Chee Yong Aditya Hridaya MISRA Jeffery JI Jun Yao 3/30/2010 DHL Data Mining Project Table of Contents Introduction to DHL and the
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Big Data: Rethinking Text Visualization
Big Data: Rethinking Text Visualization Dr. Anton Heijs [email protected] Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important
<no narration for this slide>
1 2 The standard narration text is : After completing this lesson, you will be able to: < > SAP Visual Intelligence is our latest innovation
Build Your First Web-based Report Using the SAS 9.2 Business Intelligence Clients
Technical Paper Build Your First Web-based Report Using the SAS 9.2 Business Intelligence Clients A practical introduction to SAS Information Map Studio and SAS Web Report Studio for new and experienced
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
RnavGraph: A visualization tool for navigating through high-dimensional data
Int. Statistical Inst.: Proc. 58th World Statistical Congress, 2011, Dublin (Session IPS117) p.1852 RnavGraph: A visualization tool for navigating through high-dimensional data Waddell, Adrian University
Information Visualization WS 2013/14 11 Visual Analytics
1 11.1 Definitions and Motivation Lot of research and papers in this emerging field: Visual Analytics: Scope and Challenges of Keim et al. Illuminating the path of Thomas and Cook 2 11.1 Definitions and
Creating a universe on Hive with Hortonworks HDP 2.0
Creating a universe on Hive with Hortonworks HDP 2.0 Learn how to create an SAP BusinessObjects Universe on top of Apache Hive 2 using the Hortonworks HDP 2.0 distribution Author(s): Company: Ajay Singh
IBM SPSS Data Preparation 22
IBM SPSS Data Preparation 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
Demographics of Atlanta, Georgia:
Demographics of Atlanta, Georgia: A Visual Analysis of the 2000 and 2010 Census Data 36-315 Final Project Rachel Cohen, Kathryn McKeough, Minnar Xie & David Zimmerman Ethnicities of Atlanta Figure 1: From
White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics
White Paper Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics Contents Self-service data discovery and interactive predictive analytics... 1 What does
Diagrams and Graphs of Statistical Data
Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in
Situational Awareness Through Network Visualization
CYBER SECURITY DIVISION 2014 R&D SHOWCASE AND TECHNICAL WORKSHOP Situational Awareness Through Network Visualization Pacific Northwest National Laboratory Daniel M. Best Bryan Olsen 11/25/2014 Introduction
InfiniteInsight 6.5 sp4
End User Documentation Document Version: 1.0 2013-11-19 CUSTOMER InfiniteInsight 6.5 sp4 Toolkit User Guide Table of Contents Table of Contents About this Document 3 Common Steps 4 Selecting a Data Set...
Novell ZENworks Asset Management 7.5
Novell ZENworks Asset Management 7.5 w w w. n o v e l l. c o m October 2006 USING THE WEB CONSOLE Table Of Contents Getting Started with ZENworks Asset Management Web Console... 1 How to Get Started...
Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel
Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined
Introduction Course in SPSS - Evening 1
ETH Zürich Seminar für Statistik Introduction Course in SPSS - Evening 1 Seminar für Statistik, ETH Zürich All data used during the course can be downloaded from the following ftp server: ftp://stat.ethz.ch/u/sfs/spsskurs/
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
Structural Health Monitoring Tools (SHMTools)
Structural Health Monitoring Tools (SHMTools) Getting Started LANL/UCSD Engineering Institute LA-CC-14-046 c Copyright 2014, Los Alamos National Security, LLC All rights reserved. May 30, 2014 Contents
DataPA OpenAnalytics End User Training
DataPA OpenAnalytics End User Training DataPA End User Training Lesson 1 Course Overview DataPA Chapter 1 Course Overview Introduction This course covers the skills required to use DataPA OpenAnalytics
Contents WEKA Microsoft SQL Database
WEKA User Manual Contents WEKA Introduction 3 Background information. 3 Installation. 3 Where to get WEKA... 3 Downloading Information... 3 Opening the program.. 4 Chooser Menu. 4-6 Preprocessing... 6-7
SPSS: Getting Started. For Windows
For Windows Updated: August 2012 Table of Contents Section 1: Overview... 3 1.1 Introduction to SPSS Tutorials... 3 1.2 Introduction to SPSS... 3 1.3 Overview of SPSS for Windows... 3 Section 2: Entering
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
2015 Workshops for Professors
SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
A Web-based Interactive Data Visualization System for Outlier Subspace Analysis
A Web-based Interactive Data Visualization System for Outlier Subspace Analysis Dong Liu, Qigang Gao Computer Science Dalhousie University Halifax, NS, B3H 1W5 Canada [email protected] [email protected] Hai
How to Get More Value from Your Survey Data
Technical report How to Get More Value from Your Survey Data Discover four advanced analysis techniques that make survey research more effective Table of contents Introduction..............................................................2
Getting Started with the ArcGIS Predictive Analysis Add-In
Getting Started with the ArcGIS Predictive Analysis Add-In Table of Contents ArcGIS Predictive Analysis Add-In....................................... 3 Getting Started 4..............................................
Pentaho Data Mining Last Modified on January 22, 2007
Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org
Tableau Your Data! Wiley. with Tableau Software. the InterWorks Bl Team. Fast and Easy Visual Analysis. Daniel G. Murray and
Tableau Your Data! Fast and Easy Visual Analysis with Tableau Software Daniel G. Murray and the InterWorks Bl Team Wiley Contents Foreword xix Introduction xxi Part I Desktop 1 1 Creating Visual Analytics
Figure 1. An embedded chart on a worksheet.
8. Excel Charts and Analysis ToolPak Charts, also known as graphs, have been an integral part of spreadsheets since the early days of Lotus 1-2-3. Charting features have improved significantly over the
Using an In-Memory Data Grid for Near Real-Time Data Analysis
SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses
