Using APSIM, C# and R to Create and Analyse Large Datasets
|
|
|
- Anthony Thomas
- 10 years ago
- Views:
Transcription
1 21st International Congress on Modelling and Simulation, Gold Coast, Australia, 29 Nov to 4 Dec Using APSIM, C# and R to Create and Analyse Large Datasets J. L. Fainges a a CSIRO Agriculture, Australia [email protected] Abstract: With the advent of cheap cluster computing, advanced models and improved analytical techniques, scientists are able to explore larger problem spaces than ever before. However, with this power comes added complexity. Large factorial simulations need to be designed and created then the output which can be many hundreds of gigabytes or more needs to be analysed and presented in a meaningful way. This paper details one such scenario using the Agricultural Production Systems Simulator (APSIM) (Holzworth et al, 2014). Almost simulations were created using five base simulations that generated approximately 100GB of output data. This output data was loaded into R using a new package designed for loading, testing, manipulating and exporting both input and output APSIM formatted data. This paper explores the mechanics of large scale APSIM simulations including how to leverage the APSIM User Interface (UI) to quickly build a starting point simulation, using the XML libraries in the.net framework (C# in this case) to easily duplicate, replace and modify structures in the base simulation to create a large factorial and finally looks at methods to analyse large output data sets. The techniques demonstrated here are scalable; while simulations were generated for this project, the same method could be applied to a grid analysis with millions of simulations. Similarly, the processing in R will work with any size data set so long as the computer used has enough memory. Even the language used for generating the simulations can be changed. C# is used here but any language with a robust XML interpreter could be used. In order to keep the analysis run time down and simplify factorial combination generation, the top level factor (crop type) was used as a divider to split up the output into individual batches that were then processed in R. APSIM writes a single output file for each simulation. As such, large factorial simulations can produce millions of individual files that need to be processed. To assist with this, an R package has been created and is available on CRAN (under the package name APSIM ) that automates the loading of multiple APSIM output files into a single R data frame or data table (Dowle, et al., 2014). This package is able to handle single output and factorial simulations as well as import constants as separate columns. Additionally, extra utilities have been added that simplify the process of creating meteorological data files for simulation input. R has many ways of doing the same thing, some of which are more efficient than others. A number of processes for reading and parsing data were evaluated and the most efficient methods were incorporated into the APSIM package. This evaluation has applications beyond APSIM as they are methods that anyone importing large quantities of data into R will be able to implement. The package assumes that any data sets loaded will fit into system memory. As simulations become more complex this assumption will become less valid and new approaches will need to be taken such as the use of databases to store data that is not being actively worked on. There are R packages already in existence that do this and the APSIM package will be expanded over time to utilise these options. Optimisations used include minimising file access, binding files after reading all of them, using dedicated data reading packages and applying techniques that reduce the amount of memory management required. The optimised data reader was able to import and bind APSIM output data at speeds in excess of 11MB/s for 5MB data files. Keywords: APSIM, R, programming, XML, grid 333
2 1. INTRODUCTION With the recent advent of large scale, cheap cluster computing solutions, scientists working in the field of modelling and simulation are able to model scenarios on scales that were unprecedented only a few years ago. The availability of commodity servers run by external providers now means that computing power that was only accessible to the largest of research institutions is now in the hands of smaller groups and even individual researches. Simulations that may have taken months to complete can now be run in days or hours. Large models with high memory requirements that would have needed to queue on a large cluster system can now be rented out from cloud service providers for a few dollars per hour. There is, however, a downside. The ability to run more accurate models faster and at higher resolution results in a corresponding increase in simulation output. Where researchers may once have dealt with outputs in the hundreds of megabytes, this explosion in computing power means that datasets in the hundreds of gigabytes or even terabyte range are not uncommon. This exponential rise in data production, known as big data, is changing both the commercial and scientific landscapes. As such, the tools and techniques that were once used for analysis are no longer suited to the task and new approaches must be found. This paper looks at one such scenario where approximately simulations were generated from five base simulations that were built by hand. APSIM contains a lot of flexibility to specify how much output is generated, however this project needed daily output over 60 years for a number of variables so the raw output of these runs was around 100 gigabytes (5.5MB per file). While the simulator used here was the Agricultural Production Systems Simulation (APSIM), the described method can be used by any simulator with a textual input and output. Similarly, the simulations that were generated were created in the C# programming language, the method is language agnostic and can scale to any number of simulations. 2. CREATING THE BASE SIMULATIONS The first step in any large scale simulation set is to plan all factors and levels. The purpose of the demonstration analysis used here was to visualise photo thermal quotient (Silva et al., 2014) across Australia at different time periods for different crops and cultivars. The factors used were crop, cultivar, sowing time and site. There were five crops, three cultivars (nine were used for wheat to better simulate the geographic segregation of varieties), five sowing dates and 660 sites around Australia. The sites were taken from the ApSoil data base (ApSoil, 2013). 5 Crops 3 Cultivars 5 Sowing Dates 660 Sites Figure 1. Factor hierarchy that generated combinations. Extra cultivars were used for wheat giving a total of just under simulations. Once the problem domain has been specified and appropriate factors planned, a number of base simulations should be prepared that capture all information that will remain static across factor levels. The method used for this analysis involved creating one complete simulation for each of the highest factor levels. While not strictly necessary, this approach was taken to simplify processing during the analysis. Splitting a large factorial analysis into smaller batches also has benefits for processing runtime as will be shown below. One base simulation was built by hand for each level of the highest factor which was the crop in this case. The base simulations contain everything needed for a single combination. To ensure they are complete, a combination was built as though it was a standalone simulation and run to ensure it completed successfully. As five base simulations were being used for the five crops, this simulation was copied then modified to run each of the crops. Note that the only important part here is that the base simulations run successfully. The actual output does not matter. Since the base simulations will not form part of the factorial no effort was made to ensure that they were spatially correct. For instance, weather and soil data were not parameterised for any of the sites that were being modelled; they were just left at their defaults. 334
3 3. CREATING THE FACTORIALS An APSIM input simulation file is a simple XML structured text file. <folder version="36" creator="apsim 7.7-r3601" name="simulations"> <simulation name="wheaxe"> <metfile name="met"> <filename name="filename" input="yes">8001.met</filename> </metfile> <clock> <start_date type="date" description="enter the start date of the simulation">01/01/2005</start_date> <end_date type="date" description="enter the end date of the simulation">31/12/2013</end_date> </clock> <summaryfile />... </simulation> </folder> Figure 2. A sample from an APSIM input file showing the XML structure. Once the base simulations were running, they were modified so that a C# program could replace tokens in the file with actual factorial parameters. Where the factor to replace was a complete XML element, no change was made. The C# XML engine will replace the entire element. Where the parameter is part of an element, a token was manually inserted as shown in figure 2. [EventHandler] public void OnInitialised() { reportgs31 = (Outputfile) MyPaddock.LinkByName("reportGS31"); reportgs65 = (Outputfile) MyPaddock.LinkByName("reportGS65"); town = "@TOWN"; trial = "@TRIAL"; } Figure 3. A portion of an XML element (C# script) where placeholder tokens (highlighted) are being used. While either method can be used alone, this hybrid approach strikes a good compromise between speed and simplicity. Using only the XML element replacement method means that large elements need to be repeated in their entirety which can be cumbersome for large elements such as manager scripts (especially where two or more parameters need to be changed as above), while using only the token replacement approach means that each parameter needs to be manually modified to accept a token which can be slow when there is a large number of parameters. Once the base simulations were prepared, a C# program was used to read in all factor combinations and create simulations by replacing tokens and elements. Using the XML functions in C# it becomes very easy to replace individual elements or even entire blocks. In this case there were five tokens and seven elements to be replaced, one of which (the soil) had multiple child nodes. The tokens were replaced first by converting the entire XML document to a string, doing the replacements then converting the string back to an XML document. Next, the elements were replaced by retrieving the element from the document then changing its value. Since the element value is passed by reference there is no need to write it back to the document. The process for changing the most complex element, the soil, is shown in figure
4 // Get the soil from the simulation using built in APSIM XML functions. // These are standard.net XML functions customised for APSIM structures. XElement Soil = Simulation.Descendants("Soil").ToList()[0]; // Get a new Soil from the ApSoil database (read elsewhere). XElement NewSoil = XElement.Parse((string)Row.Cells["SoilXML"].Value); // Set the name of the element and replace the old one. NewSoil.FirstAttribute.Value = "Soil"; Soil.ReplaceWith(NewSoil); Figure 4. The code used to replace a multi-child XML node. These four lines are all that is needed to replace any element in the simulation. Even a complex multi-child node such as the soil is very simple to replace. Similar code can be used for any language with an XML parser. 4. ANALYSING THE DATA SET The purpose of the analysis was to create maps showing the intensity of a number of variables at various sites across Australia. 660 sites from the ApSoil database (ApSoil, 2013) were analysed and this data was interpolated to create the final graphic. The plotted region is the Australian wheat belt (Zheng et al., 2013). Figure 5. An example of one of the maps created for the analysis showing the photothermal quotient, which is the ratio of solar radiation to mean temperature, for a crop. Units are MJ m -2 d -1 C -1. Each graph was generated from approximately 4GB of raw data. There were 5 crops and each crop had 5 graphs for a total of 100GB of simulation output data. 4.1 Preparing the simulation output As was mentioned, the data was split by crop into batches and each batch was analysed individually with its own R script. While the entire data set could have been done in a single batch, having multiple smaller batches made for simpler scripts and faster runtimes due to the way R handles memory management 336
5 (Sridharan et al., 2015). Data was loaded into R as a data frame which is an object that can contain multiple rows and columns of different data types, similar to an Excel table or a group of vectors where each vector is of one data type and all have the same length. Note they are not a list of vectors as a list is a different structure in R. Analysis run time for each crop was approximately 24 hours. Most of this time was spent in two lines that used the ddply function in the plyr package (Wickham, 2011) to split the initial very large (~ 60 million rows) data frame into individual combinations then run a function on each. First, the data is split into groups according to one or more variables similar to an Excel filter or SQL GROUP BY statement. Each group is then treated as a separate data frame that has a user defined function applied to it. Finally the results of each grouped calculation are bound together into a final d. The number of combinations plyr was splitting and combining was over The amount of memory management this required resulted in very inefficient processing. This was later reduced to around 2 hours by splitting the dataset before processing and looping over other factor levels then joining the resultant data frames once all loops were complete. By reducing the number of combinations in each call to ddply, extra memory management is avoided which drastically improves run time. 4.2 Creating the APSIM Package APSIM output files are not standard tabulated data. Figure 5 shows that the actual data is preceded by a number of constants, units and headings. ApsimVersion = 7.7 Title = HGriffith Date biomass yield (dd/mm/yyyy) (kg/ha) (kg/ha) 10/11/ /11/ Figure 6. The first few lines from an APSIM output file showing extra data before the simulation output. As such, the usual data import functions such as read.table will not work. Additionally, the units and header rows must be read separately. The number of constants (key = value above) can also change. The development of the loadapsim function went through a few iterations to address these conditions and optimise for the best speed. While the number of constants and the line where the data starts is unknown, the order they appear in is not. Constants will come first, then the header row, units and finally data. Constants will always contain an equals (=) sign. In the first iteration of the function, the entire file was read line by line. A cascading if..else statement was used to capture the correct data in each line. When the output table was reached, it would continue reading line by line and bind everything together at the end. This process turned out to be very slow as it did not use the optimisations available when reading large blocks of data. Since the constants and header information is a very small portion of the entire output file, a second approach was taken where the file was read line by line as before until the start of the tabular data was reached. A counter kept track of this point and read.table was then used with the skip parameter to read the rest of the file. Constants were then bound to the data frame which was added to a list which used data.table::rbindlist to create a final bound data frame after all files had been read. This process sped up the reading of multiple files considerably especially when the files were large. A file handle was kept open for this process to avoid the overhead of opening and closing files for each read operation. An important point to note is that each individual file was read into a list of data frames first. Binding each data frame one at a time using rbind (which works by taking all the rows of a dataframe and appending it to the bottom of another one) is both computationally and memory intensive due to the amount of checking and resizing that R needs to do (Wickham, 2014). Since all the data is already loaded into the list, we know how 337
6 much space to allocate to the final data frame. We can then use data.table::rbindlist and use optimisations provided by the data.table package to bind much faster. 4.3 Benchmarking and Optimisations A number of benchmarks were run during the writing of the APSIM output reader to explore different methods and identify the most efficient. The original pseudocode for the function is shown in Figure 6: For each file to be read: While NOT end of file, read a line: Trim all white space If the string "factors = " is present # This indicates the output files came from a factorial APSIM # run and the user has chosen the 'Use a single line' option # for factor level output. Split the line by ';' to separate the factors. Unlist and add each vector to constants Else if the line contains '=' it is a constant: Split the line by '=' and add to constants Else the next two lines will be column headers and units: Split the lines, remove extra whitespace and store in vectors Everything after this is data so store this line number and break from the while loop End While Use read.table to read in the data from the stored line number End For Add constants as columns Add column names Use data.table::rbindlist to merge all files into a single data table Return data frame or table according to user choice Figure 7. Pseudocode for the APSIM output reader. This original iteration took seconds to load one thousand 10KB files. Table 1 shows the optimisations that were made. Table 1. Effect of optimisations on run speed. Times are cumulative. Optimisation Run Time (s) Base (no optimisation) Unlist(, use.names = FALSE) Remove trim (only trim when required) 7.78 Strsplit(, fixed = TRUE) 3.87 Removing the trim statement had the largest effect on runtime. It was determined that trimming every line was not required as any extra elements caused by white space could be removed more efficiently by sub setting the resultant object. Moving to read.table for the data portion also negated the need for trimming each line. The next largest effect was that of splitting strings into elements. By default, strsplit will use a regular expression to do the splitting even if a literal character or string is given. Table 2. Run times for various options in two string manipulation functions. Numbers in brackets are relative to the default options in the base package. String Splitting Optimisations strsplit {base} str_split {stringr} Run Time (µs) Run Time (µs) Regex using literal space (1.00) (1.20) Regex using whitespace delimiter \s (1.25) (1.28) Fixed search for literal space 4.73 (0.12) (1.42) 338
7 More subtle optimisations include precompiling the function using the compiler::cmpfun function which resulted in an overall 11% performance increase, and a 313% increase using square bracket notation for sub setting data frames instead of the subset function when working with very large data frames. The final benchmarks for the optimised function are shown in Table 3. Benchmarks were run on an Intel Core i5 CPU at 1.9GHz with a solid state drive. Table 3. Benchmark times for reading different numbers of files at different sizes. Numbers in brackets are size of files in bytes. 10K (10 077) 100K ( ) 1M ( ) 5M ( ) Number Time SD MB/s Time SD MB/s Time SD MB/s Time SD MB/s CONCLUSION The ability to generate large data sets brings with it the requirement to analyse that data in a timely manner. The first obstacle to analysis is the importation into an analysis pipeline. The R package APSIM presented here provides a simple, fast and flexible interface for the researcher to import their data in a format that is easy to manipulate and does not require pre-processing. It can read data at speeds in excess of 11MB/s (dependant on size of individual data files and speed of CPU and storage) and provides functions for creating simulation input weather files. Large files read faster so care should be taken to ensure that fewer, longer simulations are run where possible. For example, run long multi-year simulations with soil resets instead of multiple single-year runs. The R APSIM package is under continuous development and future releases will include the ability to read directly into databases for data sets that are too large to fit in memory. Support for APSIM Next Generation output is also planned. REFERENCES ApSoil (2013). "APSoil." Retrieved 05 July 2015, from Holzworth, D. P., N. I. Huth, P. G. Devoil, E. J. Zurcher, N. I. Herrmann, G. McLean, K. Chenu, E. J. van Oosterom, et al. (2014). "APSIM - Evolution towards a new generation of agricultural systems simulation." Environmental Modelling & Software 62: M Dowle, T. S., S Lianoglou, A Srinivasan, R Saporta, E Antonyan (2014). data.table: Extension of data.frame. Retrieved 05 July 2015 from Sadras, V. O., L. Lake, K. Chenu, L. S. McMurray and A. Leonforte (2012). "Water and thermal regimes for field pea in Australia and their implications for breeding." Crop & Pasture Science 63(1): Shriram Sridharan, J. M. P. (2015). Profiling R on a Contemporary Processor. 41st International Conference on Very Large Data Bases. Hawai'i Island, Hawaii, VLDB Endowment. 8: Silva, R. R., G. Benin, J. A. Marchese, E. D. B. da Silva and V. S. Marchioro (2014). "The use of photothermal quotient and frost risk to identify suitable sowing dates for wheat." Acta Scientiarum-Agronomy 36(1): Team, R. D. C. (2011). R: A language and environment for statistical computing. Vienna, Austria, R Foundation for Statistical Computing. Wickham, H. (2011). "The Split-Apply-Combine Strategy for Data Analysis." Journal of Statistical Software 40(1): Wickham, H. (2014). Memory. Advanced R, Chapman and Hall/CRC: Zheng, B. Y., B. Biddulph, D. R. Li, H. Kuchel and S. Chapman (2013). "Quantification of the effects of VRN1 and Ppd- D1 to predict spring wheat (Triticum aestivum) heading time across diverse environments." Journal of Experimental Botany 64(12):
Hands-On Data Science with R Dealing with Big Data. [email protected]. 27th November 2014 DRAFT
Hands-On Data Science with R Dealing with Big Data [email protected] 27th November 2014 Visit http://handsondatascience.com/ for more Chapters. In this module we explore how to load larger datasets
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski [email protected]
Kamil Bajda-Pawlikowski [email protected] Querying RDF data stored in DBMS: SPARQL to SQL Conversion Yale University technical report #1409 ABSTRACT This paper discusses the design and implementation
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
Graph Database Proof of Concept Report
Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
How To Write A Data Processing Pipeline In R
New features and old concepts for handling large and streaming data in practice Simon Urbanek R Foundation Overview Motivation Custom connections Data processing pipelines Parallel processing Back-end
EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada [email protected]
SQL Server An Overview
SQL Server An Overview SQL Server Microsoft SQL Server is designed to work effectively in a number of environments: As a two-tier or multi-tier client/server database system As a desktop database system
SQL Server Database Coding Standards and Guidelines
SQL Server Database Coding Standards and Guidelines http://www.sqlauthority.com Naming Tables: Stored Procs: Triggers: Indexes: Primary Keys: Foreign Keys: Defaults: Columns: General Rules: Rules: Pascal
Unit 2 Research Project. Eddie S. Jackson. Kaplan University. IT530: Computer Networks. Dr. Thomas Watts, PhD, CISSP
Running head: UNIT 2 RESEARCH PROJECT 1 Unit 2 Research Project Eddie S. Jackson Kaplan University IT530: Computer Networks Dr. Thomas Watts, PhD, CISSP 08/19/2014 UNIT 2 RESEARCH PROJECT 2 Abstract Application
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
rm(list=ls()) library(sqldf) system.time({large = read.csv.sql("large.csv")}) #172.97 seconds, 4.23GB of memory used by R
Big Data in R Importing data into R: 1.75GB file Table 1: Comparison of importing data into R Time Taken Packages Functions (second) Remark/Note base read.csv > 2,394 My machine (8GB of memory) ran out
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Building Scalable Applications Using Microsoft Technologies
Building Scalable Applications Using Microsoft Technologies Padma Krishnan Senior Manager Introduction CIOs lay great emphasis on application scalability and performance and rightly so. As business grows,
Redundant Data Removal Technique for Efficient Big Data Search Processing
Redundant Data Removal Technique for Efficient Big Data Search Processing Seungwoo Jeon 1, Bonghee Hong 1, Joonho Kwon 2, Yoon-sik Kwak 3 and Seok-il Song 3 1 Dept. of Computer Engineering, Pusan National
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
New Features in Neuron ESB 2.6
New Features in Neuron ESB 2.6 This release significantly extends the Neuron ESB platform by introducing new capabilities that will allow businesses to more easily scale, develop, connect and operationally
Polynomial Neural Network Discovery Client User Guide
Polynomial Neural Network Discovery Client User Guide Version 1.3 Table of contents Table of contents...2 1. Introduction...3 1.1 Overview...3 1.2 PNN algorithm principles...3 1.3 Additional criteria...3
OPTIMIZATION OF DATABASE STRUCTURE FOR HYDROMETEOROLOGICAL MONITORING SYSTEM
OPTIMIZATION OF DATABASE STRUCTURE FOR HYDROMETEOROLOGICAL MONITORING SYSTEM Ph.D. Robert SZCZEPANEK Cracow University of Technology Institute of Water Engineering and Water Management ul.warszawska 24,
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Consumption of OData Services of Open Items Analytics Dashboard using SAP Predictive Analysis
Consumption of OData Services of Open Items Analytics Dashboard using SAP Predictive Analysis (Version 1.17) For validation Document version 0.1 7/7/2014 Contents What is SAP Predictive Analytics?... 3
Report Paper: MatLab/Database Connectivity
Report Paper: MatLab/Database Connectivity Samuel Moyle March 2003 Experiment Introduction This experiment was run following a visit to the University of Queensland, where a simulation engine has been
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.
How to Ingest Data into Google BigQuery using Talend for Big Data A Technical Solution Paper from Saama Technologies, Inc. July 30, 2013 Table of Contents Intended Audience What you will Learn Background
Performance in the Infragistics WebDataGrid for Microsoft ASP.NET AJAX. Contents. Performance and User Experience... 2
Performance in the Infragistics WebDataGrid for Microsoft ASP.NET AJAX An Infragistics Whitepaper Contents Performance and User Experience... 2 Exceptional Performance Best Practices... 2 Testing the WebDataGrid...
Desktop, Web and Mobile Testing Tutorials
Desktop, Web and Mobile Testing Tutorials * Windows and the Windows logo are trademarks of the Microsoft group of companies. 2 About the Tutorial With TestComplete, you can test applications of three major
CitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
Bringing Big Data into the Enterprise
Bringing Big Data into the Enterprise Overview When evaluating Big Data applications in enterprise computing, one often-asked question is how does Big Data compare to the Enterprise Data Warehouse (EDW)?
PROBLEM SOLVING SEVENTH EDITION WALTER SAVITCH UNIVERSITY OF CALIFORNIA, SAN DIEGO CONTRIBUTOR KENRICK MOCK UNIVERSITY OF ALASKA, ANCHORAGE PEARSON
PROBLEM SOLVING WITH SEVENTH EDITION WALTER SAVITCH UNIVERSITY OF CALIFORNIA, SAN DIEGO CONTRIBUTOR KENRICK MOCK UNIVERSITY OF ALASKA, ANCHORAGE PEARSON Addison Wesley Boston San Francisco New York London
PERFORMANCE MODELS FOR APACHE ACCUMULO:
Securely explore your data PERFORMANCE MODELS FOR APACHE ACCUMULO: THE HEAVY TAIL OF A SHAREDNOTHING ARCHITECTURE Chris McCubbin Director of Data Science Sqrrl Data, Inc. I M NOT ADAM FUCHS But perhaps
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a
Package retrosheet. April 13, 2015
Type Package Package retrosheet April 13, 2015 Title Import Professional Baseball Data from 'Retrosheet' Version 1.0.2 Date 2015-03-17 Maintainer Richard Scriven A collection of tools
Team Members: Christopher Copper Philip Eittreim Jeremiah Jekich Andrew Reisdorph. Client: Brian Krzys
Team Members: Christopher Copper Philip Eittreim Jeremiah Jekich Andrew Reisdorph Client: Brian Krzys June 17, 2014 Introduction Newmont Mining is a resource extraction company with a research and development
DataPA OpenAnalytics End User Training
DataPA OpenAnalytics End User Training DataPA End User Training Lesson 1 Course Overview DataPA Chapter 1 Course Overview Introduction This course covers the skills required to use DataPA OpenAnalytics
Big data in R EPIC 2015
Big data in R EPIC 2015 Big Data: the new 'The Future' In which Forbes magazine finds common ground with Nancy Krieger (for the first time ever?), by arguing the need for theory-driven analysis This future
Technical Report. The KNIME Text Processing Feature:
Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold [email protected] [email protected] Copyright 2012 by KNIME.com AG
Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters
Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.
Toad for Oracle 8.6 SQL Tuning
Quick User Guide for Toad for Oracle 8.6 SQL Tuning SQL Tuning Version 6.1.1 SQL Tuning definitively solves SQL bottlenecks through a unique methodology that scans code, without executing programs, to
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Contents WEKA Microsoft SQL Database
WEKA User Manual Contents WEKA Introduction 3 Background information. 3 Installation. 3 Where to get WEKA... 3 Downloading Information... 3 Opening the program.. 4 Chooser Menu. 4-6 Preprocessing... 6-7
Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
Kentico CMS 6.0 Performance Test Report. Kentico CMS 6.0. Performance Test Report February 2012 ANOTHER SUBTITLE
Kentico CMS 6. Performance Test Report Kentico CMS 6. Performance Test Report February 212 ANOTHER SUBTITLE 1 Kentico CMS 6. Performance Test Report Table of Contents Disclaimer... 3 Executive Summary...
Oracle Data Miner (Extension of SQL Developer 4.0)
An Oracle White Paper September 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Integrate Oracle R Enterprise Mining Algorithms into a workflow using the SQL Query node Denny Wong Oracle Data Mining
Performance Test Report KENTICO CMS 5.5. Prepared by Kentico Software in July 2010
KENTICO CMS 5.5 Prepared by Kentico Software in July 21 1 Table of Contents Disclaimer... 3 Executive Summary... 4 Basic Performance and the Impact of Caching... 4 Database Server Performance... 6 Web
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs)
Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs) 1. Foreword Magento is a PHP/Zend application which intensively uses the CPU. Since version 1.1.6, each new version includes some
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,
How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)
WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...
Amazon EC2 XenApp Scalability Analysis
WHITE PAPER Citrix XenApp Amazon EC2 XenApp Scalability Analysis www.citrix.com Table of Contents Introduction...3 Results Summary...3 Detailed Results...4 Methods of Determining Results...4 Amazon EC2
Unit 4.3 - Storage Structures 1. Storage Structures. Unit 4.3
Storage Structures Unit 4.3 Unit 4.3 - Storage Structures 1 The Physical Store Storage Capacity Medium Transfer Rate Seek Time Main Memory 800 MB/s 500 MB Instant Hard Drive 10 MB/s 120 GB 10 ms CD-ROM
Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata
Up Your R Game James Taylor, Decision Management Solutions Bill Franks, Teradata Today s Speakers James Taylor Bill Franks CEO Chief Analytics Officer Decision Management Solutions Teradata 7/28/14 3 Polling
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
Reporting MDM Data Attribute Inconsistencies for the Enterprise Using DataFlux
Reporting MDM Data Attribute Inconsistencies for the Enterprise Using DataFlux Ernesto Roco, Hyundai Capital America (HCA), Irvine, CA ABSTRACT The purpose of this paper is to demonstrate how we use DataFlux
Network device management solution
iw Management Console Network device management solution iw MANAGEMENT CONSOLE Scalability. Reliability. Real-time communications. Productivity. Network efficiency. You demand it from your ERP systems
CS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
Resources You can find more resources for Sync & Save at our support site: http://www.doforms.com/support.
Sync & Save Introduction Sync & Save allows you to connect the DoForms service (www.doforms.com) with your accounting or management software. If your system can import a comma delimited, tab delimited
Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model
Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Condro Wibawa, Irwan Bastian, Metty Mustikasari Department of Information Systems, Faculty of Computer Science and
White Papers: EDIDocument Crystal Universe Software Document Version: 1.2. Overview. General EDI Segment Structure
White Papers: EDIDocument Crystal Universe Software Document Version: 1.2 Overview EDIDocument is a powerful state-of-the-art component that provides a fast and easy way to create EDI files. EDIDocument
How to Choose Between Hadoop, NoSQL and RDBMS
How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A
2. Distributed Handwriting Recognition. Abstract. 1. Introduction
XPEN: An XML Based Format for Distributed Online Handwriting Recognition A.P.Lenaghan, R.R.Malyan, School of Computing and Information Systems, Kingston University, UK {a.lenaghan,r.malyan}@kingston.ac.uk
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
Hard Disk Drive vs. Kingston SSDNow V+ 200 Series 240GB: Comparative Test
Hard Disk Drive vs. Kingston Now V+ 200 Series 240GB: Comparative Test Contents Hard Disk Drive vs. Kingston Now V+ 200 Series 240GB: Comparative Test... 1 Hard Disk Drive vs. Solid State Drive: Comparative
Semester Thesis Traffic Monitoring in Sensor Networks
Semester Thesis Traffic Monitoring in Sensor Networks Raphael Schmid Departments of Computer Science and Information Technology and Electrical Engineering, ETH Zurich Summer Term 2006 Supervisors: Nicolas
HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect [email protected]
HPC and Big Data EPCC The University of Edinburgh Adrian Jackson Technical Architect [email protected] EPCC Facilities Technology Transfer European Projects HPC Research Visitor Programmes Training
A Scheme for Automation of Telecom Data Processing for Business Application
A Scheme for Automation of Telecom Data Processing for Business Application 1 T.R.Gopalakrishnan Nair, 2 Vithal. J. Sampagar, 3 Suma V, 4 Ezhilarasan Maharajan 1, 3 Research and Industry Incubation Center,
DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7
DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7 UNDER THE GUIDANCE Dr. N.P. DHAVALE, DGM, INFINET Department SUBMITTED TO INSTITUTE FOR DEVELOPMENT AND RESEARCH IN BANKING TECHNOLOGY
What's New in SAS Data Management
Paper SAS034-2014 What's New in SAS Data Management Nancy Rausch, SAS Institute Inc., Cary, NC; Mike Frost, SAS Institute Inc., Cary, NC, Mike Ames, SAS Institute Inc., Cary ABSTRACT The latest releases
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
C Compiler Targeting the Java Virtual Machine
C Compiler Targeting the Java Virtual Machine Jack Pien Senior Honors Thesis (Advisor: Javed A. Aslam) Dartmouth College Computer Science Technical Report PCS-TR98-334 May 30, 1998 Abstract One of the
Datacenter Operating Systems
Datacenter Operating Systems CSE451 Simon Peter With thanks to Timothy Roscoe (ETH Zurich) Autumn 2015 This Lecture What s a datacenter Why datacenters Types of datacenters Hyperscale datacenters Major
Installation and Deployment
Installation and Deployment Help Documentation This document was auto-created from web content and is subject to change at any time. Copyright (c) 2016 SmarterTools Inc. Installation and Deployment Browser
<no narration for this slide>
1 2 The standard narration text is : After completing this lesson, you will be able to: < > SAP Visual Intelligence is our latest innovation
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Vector analysis - introduction Spatial data management operations - Assembling datasets for analysis. Data management operations
Vector analysis - introduction Spatial data management operations - Assembling datasets for analysis Transform (reproject) Merge Append Clip Dissolve The role of topology in GIS analysis Data management
ORACLE SERVICE CLOUD GUIDE: HOW TO IMPROVE REPORTING PERFORMANCE
ORACLE SERVICE CLOUD GUIDE: HOW TO IMPROVE REPORTING PERFORMANCE Best Practices to Scale Oracle Service Cloud Analytics for High Performance ORACLE WHITE PAPER MARCH 2015 Table of Contents Target Audience
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
Exercise 4 Learning Python language fundamentals
Exercise 4 Learning Python language fundamentals Work with numbers Python can be used as a powerful calculator. Practicing math calculations in Python will help you not only perform these tasks, but also
4/25/2016 C. M. Boyd, [email protected] Practical Data Visualization with JavaScript Talk Handout
Practical Data Visualization with JavaScript Talk Handout Use the Workflow Methodology to Compare Options Name Type Data sources End to end Workflow Support Data transformers Data visualizers General Data
Cisco Data Preparation
Data Sheet Cisco Data Preparation Unleash your business analysts to develop the insights that drive better business outcomes, sooner, from all your data. As self-service business intelligence (BI) and
MAGENTO HOSTING Progressive Server Performance Improvements
MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 [email protected] 1.866.963.0424 www.simplehelix.com 2 Table of Contents
QLIKVIEW ARCHITECTURE AND SYSTEM RESOURCE USAGE
QLIKVIEW ARCHITECTURE AND SYSTEM RESOURCE USAGE QlikView Technical Brief April 2011 www.qlikview.com Introduction This technical brief covers an overview of the QlikView product components and architecture
Database Application Developer Tools Using Static Analysis and Dynamic Profiling
Database Application Developer Tools Using Static Analysis and Dynamic Profiling Surajit Chaudhuri, Vivek Narasayya, Manoj Syamala Microsoft Research {surajitc,viveknar,manojsy}@microsoft.com Abstract
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
Automated Performance Testing of Desktop Applications
By Ostap Elyashevskyy Automated Performance Testing of Desktop Applications Introduction For the most part, performance testing is associated with Web applications. This area is more or less covered by
PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0
PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0 15 th January 2014 Al Chrosny Director, Software Engineering TreeAge Software, Inc. [email protected] Andrew Munzer Director, Training and Customer
