Hands-On Data Science with R Dealing with Big Data. [email protected]. 27th November 2014 DRAFT
|
|
|
- Mark Wiggins
- 10 years ago
- Views:
Transcription
1 Hands-On Data Science with R Dealing with Big Data [email protected] 27th November 2014 Visit for more Chapters. In this module we explore how to load larger datasets into R and provide operations for processing larger data within R. Whilst we will demonstrate some timing differences between approaches, the purpose of this chapter is to present efficient approaches for dealing with large datasets rather than a systematic comparison of alternatives. The required packages for this module include: library(data.table) library(dplyr) library(rattle) #library(rbenchmark) #library(sqldf) # Efficient data storage # Efficient data manipulation # For normvarnames. As we work through this chapter, new R commands will be introduced. Be sure to review the command s documentation and understand what the command does. You can ask for help using the? command as in:?read.csv We can obtain documentation on a particular package using the help= option of library(): library(help=rattle) This chapter is intended to be hands on. To learn effectively, you are encouraged to have R running (e.g., RStudio) and to run all the commands as they appear here. Check that you get the same output, and you understand the output. Try some variations. Explore. Copyright Graham Williams. You can freely copy, distribute, or adapt this material, as long as the attribution is retained and derivative work is provided under the same license.
2 1 Loading Data From CSV Loading data from a CSV file is one of the simplest ways to load data into R. The standard function to do so is read.csv() but it has some inherent default inefficiencies. We can instead use fread() from data.table (Dowle et al., 2014) to more quickly load data into R. The time differences can be significant, and indeed, can be the difference between being able to load big data and not being able to. To illustrate we will use a small dataset, and so we expect to see small absolute improvements in time, although the relative improvements are demonstrated as significant. The dataset we use is the weatheraus dataset from rattle (Williams, 2014), available as a CSV file. fnwa <- file.path("data", "weatheraus.csv") cat(format(file.info(fnwa)$size, big.mark=","), "\n") ## 7,426,513 Loading the dataset from data/weatheraus.csv using read.csv() takes about 1 second. system.time(ds <- read.csv(file=fnwa)) ## user system elapsed ## dim(ds) ## [1] class(ds) ## [1] "data.frame" Using fread() we see that the time taken is quite a bit less: system.time(ds <- fread(input=fnwa)) ## user system elapsed ## dim(ds) ## [1] class(ds) ## [1] "data.table" "data.frame" Loading the same dataset from data/weatheraus.csv using fread() takes just 15% of the time required by read.csv(). If this translated through to larger datasets, then something that might take a day to load (e.g., 100 million observations of 1,000 variables) using read.csv() will take 4 hours (though generally less) using fread(). Copyright [email protected] Module: BigDataO Page: 1 of 14
3 2 Loading Big Data from CSV Options For read.csv() As the datasets get larger, read.csv() becomes less useful for reading from file. When reading very much larger CSV files the times can be quite substantial. We can still use read.csv() but with some options we can do better. One suggestion is to tell read.csv() the number of rows that we wish to load, using nrow=. Even though we might set this to a value slightly larger than the number of rows in the dataset, it is useful to do so, since it seems to let R know not to allocate too much unnecessary memory. In general R will not know how much memory will be needed to load the dataset into, and so may attempt to allocate much more than actually required, as it is processing the data. Without telling R an expected number of rows we can run out of memory on smaller memory machines. Setting it to a size slightly larger than the number of rows to be read is advised. ds <- read.csv(file=fnbd, nrow=1.1e6) We can also slightly improve read.csv() load times by not converting strings to factors. ds <- read.csv(file=fnbd, nrow=1.1e6, stringsasfactors=false) Using fread() we don t need to take the above precautions, such as specifying a row number, since fread() is well tuned to loading large datasets fast. ds <- fread(input=fnbd, showprogress=false)) We can help read.csv() quite a bit more, and avoid a lot of extra processing and memory usage by telling read.csv() the data types of the columns of the CSV file. We do this using colclasses=. Often though we are not all that sure what the classes should be, and there may be many of them, and it would not be reasonable to have to manually specify each one of them. We can get a pretty good idea by reading only a part of the CSV file as to the data types of the various columns. A common approach is to make use of nrows= to read a limited number of rows. From the first 5000 rows, for example, we can extract the classes of the columns as automatically determined by R and then use that information to read the whole dataset. Note firstly the expected significant reduction in time in reading just the first 5,000 rows. system.time(dsf <- read.csv(file=fnwa, nrows=5e3)) ## user system elapsed ## classes <- sapply(dsf, class) classes ## Date Location MinTemp MaxTemp Rainfall ## "factor" "factor" "numeric" "numeric" "numeric" ## Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ## "numeric" "numeric" "factor" "integer" "factor"... Now we can read the full dataset without R having to do so much parsing: Copyright [email protected] Module: BigDataO Page: 2 of 14
4 system.time(dsf <- read.csv(file=fnwa, colclasses=classes)) ## user system elapsed ## For a small dataset, as used here for illustration, the timing differences are in fractions of a second. Here we use the larger dataset and we again communicate the number of rows to avoid R seeking too much memory: ds <- read.csv(file=fnbd, nrows=5e3) classes <- sapply(ds, class) ds <- read.csv(file=fnbd, nrows=1.1e6, colclasses=classes)) class(ds) dim(ds) For our 10GB dataset, with 2 million observations and 1000 variables, this approach reduces the read time from 30 minutes to 10 minutes. Copyright [email protected] Module: BigDataO Page: 3 of 14
5 3 Efficient Data Manipulation with DPlyr The dplyr (Wickham and Francois, 2014) package has focused on providing very efficient data manipulations over data frames and data tables. We will use our small data table to illustrate. ds <- fread(input=fnwa) setnames(ds, names(ds), normvarnames(names(ds))) ## Loading required package: stringr The basic concept for the grammar defined by dplyr is to split the data, apply some operation to the groups, then combine the data. Here we first split the data using group by: g <- ds %>% group_by(location) g ## Source: local data table [66,672 x 24] ## Groups: location ## ## date location min_temp max_temp rainfall evaporation sunshine ## Adelaide ## Adelaide ## Adelaide ## Adelaide ## Adelaide NA 0.9 ## Adelaide NA NA 1.5 ## Adelaide ## Adelaide ## Adelaide ## Adelaide ## ## Variables not shown: wind_gust_dir (chr), wind_gust_speed (int), ## wind_dir_9am (chr), wind_dir_3pm (chr), wind_speed_9am (int), ## wind_speed_3pm (int), humidity_9am (int), humidity_3pm (int), ## pressure_9am (dbl), pressure_3pm (dbl), cloud_9am (int), cloud_3pm ## (int), temp_9am (dbl), temp_3pm (dbl), rain_today (chr), risk_mm (dbl), ## rain_tomorrow (chr) Notice the compact form for printing a human readable for of the dataset. For each group we might want to report the average daily rainfall: g <- g %>% summarise(daily_rainfall = mean(rainfall, na.rm=true)) g ## Source: local data table [46 x 2] ## ## location daily_rainfall ## 1 Adelaide Copyright [email protected] Module: BigDataO Page: 4 of 14
6 We can of course combine this in one pipeline: ds %>% group_by(location) %>% summarise(daily_rainfall = mean(rainfall, na.rm=true)) ## Source: local data table [46 x 2] ## ## location daily_rainfall ## 1 Adelaide We can use either format, depending on circumstance - both can be quite readable. Copyright [email protected] Module: BigDataO Page: 5 of 14
7 4 Archetypes Overview We introduce here a new approach to handling big data for model building. The approach is called Acrhetypes. Copyright [email protected] Module: BigDataO Page: 6 of 14
8 5 Archetypes Phase 1 Cluster the Population For example, cluster on missing values. This can be interpreted as a behavioural group. We illustrate using wskm (?). Copyright [email protected] Module: BigDataO Page: 7 of 14
9 6 Archetypes Phase 2 Rule Induction to Nuggets For example, build decision trees for each cluster, convert to rules, and each rule is then a Nugget. We illustrate with wsrpart (?). Copyright [email protected] Module: BigDataO Page: 8 of 14
10 7 Archetypes Phase 3 Local Model Build per Nugget Now build a predictive model for each nugget. If not a predictive model then perhaps at least a formulation of the interestingness of the nugget. We illustrate with wsrf (?). Copyright [email protected] Module: BigDataO Page: 9 of 14
11 8 Archetypes Phase 4 Global Model Now build a single formula to combine the local models as they pertain to scoring new observations. We could use linear regression or use genetic programming. We illustrate with rgp (?). Copyright [email protected] Module: BigDataO Page: 10 of 14
12 9 Working with Data Tables Whereas we index a data frame by the observation (i) and the columns (j), the basic syntax of the data table takes a very different view, and one more familiar to database users. In terms of the familiar SQL syntax, we might think of a data table access as specifying the condition to be met for the observations we wish to extract (a WHERE clause), the columns we wish to extract (a SELECT clause) and how the data is to be aggregated (a GROUP BY clause). ds[where, SELECT, by=groupby] Copyright [email protected] Module: BigDataO Page: 11 of 14
13 10 Further Reading and Acknowledgements The Rattle Book, published by Springer, provides a comprehensive introduction to data mining and analytics using Rattle and R. It is available from Amazon. Other documentation on a broader selection of R topics of relevance to the data scientist is freely available from including the Datamining Desktop Survival Guide. This chapter is one of many chapters available from HandsOnDataScience.com. In particular follow the links on the website with a * which indicates the generally more developed chapters. Other resources include: ˆ A good overview of data.table is available through the user!2014 Tutorial. ˆ An extensive cheat sheet is available from Copyright [email protected] Module: BigDataO Page: 12 of 14
14 11 References Dowle M, Short T, Lianoglou S, with contributions from R Saporta AS, Antonyan E (2014). data.table: Extension of data.frame. R package version 1.9.4, URL org/package=data.table. R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL Wickham H, Francois R (2014). dplyr: A Grammar of Data Manipulation. R package version 0.3, URL Williams GJ (2009). Rattle: A Data Mining GUI for R. The R Journal, 1(2), URL Williams GJ (2011). Data Mining with Rattle and R: The art of excavating data for knowledge discovery. Use R! Springer, New York. URL /ref=as_li_qf_sp_asin_tl?ie=UTF8&tag=togaware-20&linkCode=as2&camp= &creative=399373&creativeASIN= Williams GJ (2014). rattle: Graphical user interface for data mining in R. R package version 3.0.4, URL This document, sourced from BigDataO.Rnw revision 542, was processed by KnitR version 1.8 of and took 3.6 seconds to process. It was generated by gjw on theano running Ubuntu LTS with Intel(R) Core(TM) i7-3517u 1.90GHz having 4 cores and 3.9GB of RAM. It completed the processing :01:11. Copyright [email protected] Module: BigDataO Page: 13 of 14
15
Data Science with R Ensemble of Decision Trees
Data Science with R Ensemble of Decision Trees [email protected] 3rd August 2014 Visit http://handsondatascience.com/ for more Chapters. The concept of building multiple decision trees to produce
How To Understand Data Mining In R And Rattle
http: // togaware. com Copyright 2014, [email protected] 1/40 Data Analytics and Business Intelligence (8696/8697) Introducing Data Science with R and Rattle [email protected] Chief
Data Analytics and Business Intelligence (8696/8697)
http: // togaware. com Copyright 2014, [email protected] 1/34 Data Analytics and Business Intelligence (8696/8697) Introducing and Interacting with R [email protected] Chief Data
Data Science with R. Introducing Data Mining with Rattle and R. [email protected]
http: // togaware. com Copyright 2013, [email protected] 1/35 Data Science with R Introducing Data Mining with Rattle and R [email protected] Senior Director and Chief Data Miner,
rm(list=ls()) library(sqldf) system.time({large = read.csv.sql("large.csv")}) #172.97 seconds, 4.23GB of memory used by R
Big Data in R Importing data into R: 1.75GB file Table 1: Comparison of importing data into R Time Taken Packages Functions (second) Remark/Note base read.csv > 2,394 My machine (8GB of memory) ran out
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
Introduction to the data.table package in R
Introduction to the data.table package in R Revised: September 18, 2015 (A later revision may be available on the homepage) Introduction This vignette is aimed at those who are already familiar with creating
SEIZE THE DATA. 2015 SEIZE THE DATA. 2015
1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Deep dive into Haven Predictive Analytics Powered by HP Distributed R and
Using APSIM, C# and R to Create and Analyse Large Datasets
21st International Congress on Modelling and Simulation, Gold Coast, Australia, 29 Nov to 4 Dec 2015 www.mssanz.org.au/modsim2015 Using APSIM, C# and R to Create and Analyse Large Datasets J. L. Fainges
Didacticiel Études de cas
1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see
The Saves Package. an approximate benchmark of performance issues while loading datasets. Gergely Daróczi [email protected].
The Saves Package an approximate benchmark of performance issues while loading datasets Gergely Daróczi [email protected] December 27, 2013 1 Introduction The purpose of this package is to be able
DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7
DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7 UNDER THE GUIDANCE Dr. N.P. DHAVALE, DGM, INFINET Department SUBMITTED TO INSTITUTE FOR DEVELOPMENT AND RESEARCH IN BANKING TECHNOLOGY
Package retrosheet. April 13, 2015
Type Package Package retrosheet April 13, 2015 Title Import Professional Baseball Data from 'Retrosheet' Version 1.0.2 Date 2015-03-17 Maintainer Richard Scriven A collection of tools
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata
Up Your R Game James Taylor, Decision Management Solutions Bill Franks, Teradata Today s Speakers James Taylor Bill Franks CEO Chief Analytics Officer Decision Management Solutions Teradata 7/28/14 3 Polling
OpenText Actuate Big Data Analytics 5.2
OpenText Actuate Big Data Analytics 5.2 OpenText Actuate Big Data Analytics 5.2 introduces several improvements that make the product more useful, powerful and flexible for end users. A new data loading
Tutorial 2: Reading and Manipulating Files Jason Pienaar and Tom Miller
Tutorial 2: Reading and Manipulating Files Jason Pienaar and Tom Miller Most of you want to use R to analyze data. However, while R does have a data editor, other programs such as excel are often better
Bigger data analysis. Hadley Wickham. @hadleywickham Chief Scientist, RStudio. Thursday, July 18, 13
http://bit.ly/bigrdata3 Bigger data analysis Hadley Wickham @hadleywickham Chief Scientist, RStudio July 2013 http://bit.ly/bigrdata3 1. What is data analysis? 2. Transforming data 3. Visualising data
Your Best Next Business Solution Big Data In R 24/3/2010
Your Best Next Business Solution Big Data In R 24/3/2010 Big Data In R R Works on RAM Causing Scalability issues Maximum length of an object is 2^31-1 Some packages developed to help overcome this problem
Importing Data into R
1 R is an open source programming language focused on statistical computing. R supports many types of files as input and the following tutorial will cover some of the most popular. Importing from text
IBM SPSS Data Preparation 22
IBM SPSS Data Preparation 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release
INSTALLATION DIRECTIONS
BEGIN... INSTALLATION DIRECTIONS How to Download and Install the Simulation Game 1. On the download page (https://ahlei.org/slatehotel/), click the link to download the install files. Windows 2. Make
Getting your data into R
1 Getting your data into R How to load and export data Dr Simon R. White MRC Biostatistics Unit, Cambridge 2012 2 Part I Case study: Reading data into R and getting output from R 3 Data formats Not just
Prerequisites. Course Outline
MS-55040: Data Mining, Predictive Analytics with Microsoft Analysis Services and Excel PowerPivot Description This three-day instructor-led course will introduce the students to the concepts of data mining,
INTEGRATING MICROSOFT DYNAMICS CRM WITH SIMEGO DS3
INTEGRATING MICROSOFT DYNAMICS CRM WITH SIMEGO DS3 Often the most compelling way to introduce yourself to a software product is to try deliver value as soon as possible. Simego DS3 is designed to get you
Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.
1 Subject Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka. This document extends a previous tutorial dedicated to the comparison of various implementations
SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequence Variants
SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequence Variants 1 Dr. Xiuwen Zheng Department of Biostatistics University of Washington Seattle Introduction Thousands of gigabyte
Big data in R EPIC 2015
Big data in R EPIC 2015 Big Data: the new 'The Future' In which Forbes magazine finds common ground with Nancy Krieger (for the first time ever?), by arguing the need for theory-driven analysis This future
Large Datasets and You: A Field Guide
Large Datasets and You: A Field Guide Matthew Blackwell [email protected] Maya Sen [email protected] August 3, 2012 A wind of streaming data, social data and unstructured data is knocking at
MultiAlign Software. Windows GUI. Console Application. MultiAlign Software Website. Test Data
MultiAlign Software This documentation describes MultiAlign and its features. This serves as a quick guide for starting to use MultiAlign. MultiAlign comes in two forms: as a graphical user interface (GUI)
Hands-On Data Science with R Exploring Data with GGPlot2. [email protected]. 22nd May 2015 DRAFT
Hands-On Data Science with R Exploring Data with GGPlot2 [email protected] 22nd May 215 Visit http://handsondatascience.com/ for more Chapters. The ggplot2 (Wickham and Chang, 215) package implements
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
SQL Server Instance-Level Benchmarks with DVDStore
SQL Server Instance-Level Benchmarks with DVDStore Dell developed a synthetic benchmark tool back that can run benchmark tests against SQL Server, Oracle, MySQL, and PostgreSQL installations. It is open-sourced
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
SQL Server An Overview
SQL Server An Overview SQL Server Microsoft SQL Server is designed to work effectively in a number of environments: As a two-tier or multi-tier client/server database system As a desktop database system
Package dsmodellingclient
Package dsmodellingclient Maintainer Author Version 4.1.0 License GPL-3 August 20, 2015 Title DataSHIELD client site functions for statistical modelling DataSHIELD
College Tuition: Data mining and analysis
CS105 College Tuition: Data mining and analysis By Jeanette Chu & Khiem Tran 4/28/2010 Introduction College tuition issues are steadily increasing every year. According to the college pricing trends report
SPSS: Getting Started. For Windows
For Windows Updated: August 2012 Table of Contents Section 1: Overview... 3 1.1 Introduction to SPSS Tutorials... 3 1.2 Introduction to SPSS... 3 1.3 Overview of SPSS for Windows... 3 Section 2: Entering
CRASH COURSE PYTHON. Het begint met een idee
CRASH COURSE PYTHON nr. Het begint met een idee This talk Not a programming course For data analysts, who want to learn Python For optimizers, who are fed up with Matlab 2 Python Scripting language expensive
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Integrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
Big Data and Parallel Work with R
Big Data and Parallel Work with R What We'll Cover Data Limits in R Optional Data packages Optional Function packages Going parallel Deciding what to do Data Limits in R Big Data? What is big data? More
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Accessing bigger datasets in R using SQLite and dplyr
Accessing bigger datasets in R using SQLite and dplyr Amherst College, Amherst, MA, USA March 24, 2015 [email protected] Thanks to Revolution Analytics for their financial support to the Five College
extreme Datamining mit Oracle R Enterprise
extreme Datamining mit Oracle R Enterprise Oliver Bracht Managing Director eoda Matthias Fuchs Senior Consultant ISE Information Systems Engineering GmbH extreme Datamining with Oracle R Enterprise About
SQL Server 2005 Features Comparison
Page 1 of 10 Quick Links Home Worldwide Search Microsoft.com for: Go : Home Product Information How to Buy Editions Learning Downloads Support Partners Technologies Solutions Community Previous Versions
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
Introduction to the data.table package in R
Introduction to the data.table package in R Reised: October 2, 2014 (A later reision may be aailable on the homepage) Introduction This ignette is aimed at those who are already familiar with creating
SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants
SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants Xiuwen Zheng Department of Biostatistics University of Washington Seattle Introduction Thousands of gigabyte
Business Intelligence Getting Started Guide
Business Intelligence Getting Started Guide 2013 Table of Contents Introduction... 1 Introduction... 1 What is Sage Business Intelligence?... 1 System Requirements... 2 Recommended System Requirements...
Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros
David Moses January 2014 Paper on Cloud Computing I Background on Tools and Technologies in Amazon Web Services (AWS) In this paper I will highlight the technologies from the AWS cloud which enable you
Pastel Evolution BIC. Getting Started Guide
Pastel Evolution BIC Getting Started Guide Table of Contents System Requirements... 4 How it Works... 5 Getting Started Guide... 6 Standard Reports Available... 6 Accessing the Pastel Evolution (BIC) Reports...
System Requirements Table of contents
Table of contents 1 Introduction... 2 2 Knoa Agent... 2 2.1 System Requirements...2 2.2 Environment Requirements...4 3 Knoa Server Architecture...4 3.1 Knoa Server Components... 4 3.2 Server Hardware Setup...5
MySQL Storage Engines
MySQL Storage Engines Data in MySQL is stored in files (or memory) using a variety of different techniques. Each of these techniques employs different storage mechanisms, indexing facilities, locking levels
MS-50401 - Designing and Optimizing Database Solutions with Microsoft SQL Server 2008
MS-50401 - Designing and Optimizing Database Solutions with Microsoft SQL Server 2008 Table of Contents Introduction Audience At Completion Prerequisites Microsoft Certified Professional Exams Student
Fact Sheet In-Memory Analysis
Fact Sheet In-Memory Analysis 1 Copyright Yellowfin International 2010 Contents In Memory Overview...3 Benefits...3 Agile development & rapid delivery...3 Data types supported by the In-Memory Database...4
1 Topic. 2 Scilab. 2.1 What is Scilab?
1 Topic Data Mining with Scilab. I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical
Getting Started Guide
Getting Started Guide Introduction... 3 What is Pastel Partner (BIC)?... 3 System Requirements... 4 Getting Started Guide... 6 Standard Reports Available... 6 Accessing the Pastel Partner (BIC) Reports...
Sisense. Product Highlights. www.sisense.com
Sisense Product Highlights Introduction Sisense is a business intelligence solution that simplifies analytics for complex data by offering an end-to-end platform that lets users easily prepare and analyze
SPSS 12 Data Analysis Basics Linda E. Lucek, Ed.D. [email protected] 815-753-9516
SPSS 12 Data Analysis Basics Linda E. Lucek, Ed.D. [email protected] 815-753-9516 Technical Advisory Group Customer Support Services Northern Illinois University 120 Swen Parson Hall DeKalb, IL 60115 SPSS
Chapter 6: Programming Languages
Chapter 6: Programming Languages Computer Science: An Overview Eleventh Edition by J. Glenn Brookshear Copyright 2012 Pearson Education, Inc. Chapter 6: Programming Languages 6.1 Historical Perspective
<no narration for this slide>
1 2 The standard narration text is : After completing this lesson, you will be able to: < > SAP Visual Intelligence is our latest innovation
IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version 6.3.1 Fix Pack 2.
IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version 6.3.1 Fix Pack 2 Reference IBM Tivoli Composite Application Manager for Microsoft Applications:
Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine
Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine Version 3.0 Please note: This appliance is for testing and educational purposes only; it is unsupported and not
300 Intelligence Reporting. Sage 300 2016 Intelligence Reporting Customer Frequently asked questions
300 Intelligence Reporting Sage 300 2016 Intelligence Reporting Customer Table of contents 1. Overview of Sage Intelligence Reporting 3 2. Comparisons of Sage Intelligence Reporting and Sage Enterprise
Classroom Demonstrations of Big Data
Classroom Demonstrations of Big Data Eric A. Suess Abstract We present examples of accessing and analyzing large data sets for use in a classroom at the first year graduate level or senior undergraduate
Chapter 7: Data Mining
Chapter 7: Data Mining Overview Topics discussed: The Need for Data Mining and Business Value The Data Mining Process: Define Business Objectives Get Raw Data Identify Relevant Predictive Variables Gain
Intro to Databases. ACM Webmonkeys 2011
Intro to Databases ACM Webmonkeys 2011 Motivation Computer programs that deal with the real world often need to store a large amount of data. E.g.: Weather in US cities by month for the past 10 years List
The University of Jordan
The University of Jordan Master in Web Intelligence Non Thesis Department of Business Information Technology King Abdullah II School for Information Technology The University of Jordan 1 STUDY PLAN MASTER'S
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Data Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
Lecture 25: Database Notes
Lecture 25: Database Notes 36-350, Fall 2014 12 November 2014 The examples here use http://www.stat.cmu.edu/~cshalizi/statcomp/ 14/lectures/23/baseball.db, which is derived from Lahman s baseball database
Performance in the Infragistics WebDataGrid for Microsoft ASP.NET AJAX. Contents. Performance and User Experience... 2
Performance in the Infragistics WebDataGrid for Microsoft ASP.NET AJAX An Infragistics Whitepaper Contents Performance and User Experience... 2 Exceptional Performance Best Practices... 2 Testing the WebDataGrid...
Outline. What is Big data and where they come from? How we deal with Big data?
What is Big Data Outline What is Big data and where they come from? How we deal with Big data? Big Data Everywhere! As a human, we generate a lot of data during our everyday activity. When you buy something,
Using IRDB in a Dot Net Project
Note: In this document we will be using the term IRDB as a short alias for InMemory.Net. Using IRDB in a Dot Net Project ODBC Driver A 32-bit odbc driver is installed as part of the server installation.
Cloud Cruiser and Azure Public Rate Card API Integration
Cloud Cruiser and Azure Public Rate Card API Integration In this article: Introduction Azure Rate Card API Cloud Cruiser s Interface to Azure Rate Card API Import Data from the Azure Rate Card API Defining
LabStats 5 System Requirements
LabStats Tel: 877-299-6241 255 B St, Suite 201 Fax: 208-473-2989 Idaho Falls, ID 83402 LabStats 5 System Requirements Server Component Virtual Servers: There is a limit to the resources available to virtual
Introduction Predictive Analytics Tools: Weka
Introduction Predictive Analytics Tools: Weka Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego Tools Landscape Considerations Scale User Interface
AQA GCSE in Computer Science Computer Science Microsoft IT Academy Mapping
AQA GCSE in Computer Science Computer Science Microsoft IT Academy Mapping 3.1.1 Constants, variables and data types Understand what is mean by terms data and information Be able to describe the difference
The programming language C. sws1 1
The programming language C sws1 1 The programming language C invented by Dennis Ritchie in early 1970s who used it to write the first Hello World program C was used to write UNIX Standardised as K&C (Kernighan
Tutorials for Project on Building a Business Analytic Model Using Data Mining Tool and Data Warehouse and OLAP Cubes IST 734
Cleveland State University Tutorials for Project on Building a Business Analytic Model Using Data Mining Tool and Data Warehouse and OLAP Cubes IST 734 SS Chung 14 Build a Data Mining Model using Data
Back Propagation Neural Networks User Manual
Back Propagation Neural Networks User Manual Author: Lukáš Civín Library: BP_network.dll Runnable class: NeuralNetStart Document: Back Propagation Neural Networks Page 1/28 Content: 1 INTRODUCTION TO BACK-PROPAGATION
Oracle Data Miner (Extension of SQL Developer 4.0)
An Oracle White Paper September 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Integrate Oracle R Enterprise Mining Algorithms into a workflow using the SQL Query node Denny Wong Oracle Data Mining
Introduction to Python
WEEK ONE Introduction to Python Python is such a simple language to learn that we can throw away the manual and start with an example. Traditionally, the first program to write in any programming language
Log management with Logstash and Elasticsearch. Matteo Dessalvi
Log management with Logstash and Elasticsearch Matteo Dessalvi HEPiX 2013 Outline Centralized logging. Logstash: what you can do with it. Logstash + Redis + Elasticsearch. Grok filtering. Elasticsearch
Distributed R for Big Data
Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies large-scale analysis. It extends R. R is a single-threaded environment which limits its utility for
Best Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
Introduction to SQL for Data Scientists
Introduction to SQL for Data Scientists Ben O. Smith College of Business Administration University of Nebraska at Omaha Learning Objectives By the end of this document you will learn: 1. How to perform
Package bigrf. February 19, 2015
Version 0.1-11 Date 2014-05-16 Package bigrf February 19, 2015 Title Big Random Forests: Classification and Regression Forests for Large Data Sets Maintainer Aloysius Lim OS_type
KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: [email protected]
KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: [email protected] Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:
Pattern Insight Clone Detection
Pattern Insight Clone Detection TM The fastest, most effective way to discover all similar code segments What is Clone Detection? Pattern Insight Clone Detection is a powerful pattern discovery technology
IST687 Applied Data Science
1 IST687 Applied Data Science Course: Instructor: IST687 Applied Data Science Gary Krudys Semester: E-Mail: Spring 2015 [email protected] Office: 114 Hinds Hall Phone: 315-857-7243 (cell) Office hours:
Using Big Datasets in Stata
Using Big Datasets in Stata Josh Klugman and Min Gong Department of Sociology Indiana University On the sociology LAN, Stata is configured to use 12 megabytes (MB) of memory, which means that the user
CLC Server Command Line Tools USER MANUAL
CLC Server Command Line Tools USER MANUAL Manual for CLC Server Command Line Tools 2.5 Windows, Mac OS X and Linux September 4, 2015 This software is for research purposes only. QIAGEN Aarhus A/S Silkeborgvej
