Bigger data analysis. Hadley Chief Scientist, RStudio. Thursday, July 18, 13

Size: px
Start display at page:

Download "Bigger data analysis. Hadley Wickham. @hadleywickham Chief Scientist, RStudio. Thursday, July 18, 13"

Transcription

1 Bigger data analysis Hadley Chief Scientist, RStudio July 2013

2 1. What is data analysis? 2. Transforming data 3. Visualising data

3 What is data analysis?

4 Data analysis Data analysis the process is the process by which by data which becomes data becomes understanding, understanding, knowledge knowledge and insight and insight

5 Data analysis is the process by which data becomes understanding, knowledge and insight

6 Visualise Tidy Transform Model

7 Frequent data analysis learn to program

8 Cognition time Computation time

9 Visualise ggplot2 Tidy reshape2 stringr lubridate Transform plyr Model

10 Computation time Cognition time

11 Visualise bigvis Tidy Transform dplyr Model

12 Studio Data Every commercial US flight : ~76 million flights Total database: ~11 Gb >100 variables, but I ll focus on a handful: airline, delay, distance, flight time and speed.

13 Transformation

14 Split Apply Combine name n total name n Al 2 2 Al 2 name n Bo 4 Bo 4 total name total Al 2 Bo 0 Bo 0 9 Bo 9 Bo 5 Bo 5 Ed 15 Ed 5 name n total Ed 10 Ed 5 15 Ed 10

15 array data frame list nothing array aaply adply alply a_ply data frame daply ddply dlply d_ply list laply ldply llply l_ply n replicates raply rdply rlply r_ply function arguments maply mdply mlply m_ply

16 array data frame list nothing array aaply adply alply a_ply data frame daply ddply dlply d_ply list laply ldply llply l_ply n replicates raply rdply rlply r_ply function arguments maply mdply mlply m_ply

17 a_ply alply aaply l_ply fun daply adply laply d_ply use Never Occassionally Often All the time llply dlply ldply ddply count

18 Data analysis verbs select: subset variables filter: subset rows mutate: add new columns summarise: reduce to a single row arrange: re-order the rows

19 Data analysis verbs + group by select: subset variables filter: subset rows mutate: add new columns summarise: reduce to a single row arrange: re-order the rows

20 h <- readrds("houston.rdata") # ~2,100,000 x 6, ~57 meg; not huge, but substantial library(plyr) ddply(h, c("year", "Month", "DayofMonth"), summarise, n = length(year)) # user system elapsed # count(h, c("year", "Month", "DayofMonth")) # user system elapsed #

21 # Often work with the same grouping variables # multiple times, so define upfront. Also refer # to variables in the same way daily_df <- group_by(h, Year, Month, DayofMonth) # Now summarise knows how to deal with grouped # data frames summarise(daily_df, n()) # user system elapsed # # 20x faster!

22 library(data.table) h_dt <- data.table(h) daily_dt <- group_by(h_dt, Year, Month, DayofMonth) summarise(daily_dt, n()) # user system elapsed # # Exactly the same syntax, but 2.5x faster! # Don't need to learn the idiosyncrasies of # data.table; just 2 lines of code

23 # And dplyr also works seamlessly with databases: ontime <- source_sqlite("flights.sqlite3", "ontime") h_db <- filter(ontime, Origin == "IAH") daily_db <- group_by(h_db, Year, Month, DayofMonth) summarise(daily_db, n()) # user system elapsed # # user system elapsed # # Much slower, but not restricted to a predefined subset # Could speed up by carefully crafting indices

24 # Behind the scenes library(dplyr) ontime <- source_sqlite("../flights.sqlite3", "ontime") translate_sql(year > 2005, ontime) # <SQL> Year > translate_sql(year > 2005L, ontime) # <SQL> Year > 2005 translate_sql(origin == "IAD" Dest == "IAD", ontime) # <SQL> Origin = 'IAD' OR Dest = 'IAD' years <- 2000:2005 translate_sql(year %in% years, ontime) # <SQL> Year IN (2000, 2001, 2002, 2003, 2004, 2005)

25 Data sources Data frames (dplyr) Data tables (dplyr) SQLite tables (dplyr) Postgresql, MySql, SQL server,... MonetDB (planned) Google bigquery (bigrquery)

26 daily_df <- group_by(h, Year, Month, DayofMonth) summarise(daily_df, n()) daily_dt <- group_by(h_dt, Year, Month, DayofMonth) summarise(daily_dt, n()) daily_db <- group_by(h_db, Year, Month, DayofMonth) summarise(daily_db, n()) # It doesn't matter how your data is stored

27 # It might even live on the web library(bigrquery) library(dplyr) library(bigrquery) h_bq <- source_bigquery(billing_project, "ontime", "houston") daily_bq <- group_by(h_bq, Year, Month, DayofMonth) system.time(summarise(daily_bq, n())) # ~2 seconds # Storage = $80 / TB / Month # Query = $35 / TB (100 GB free)

28 dplyr Currently experimental and incomplete, but it works, and you re welcome to try it out. library(devtools) install_github("assertthat") install_github("dplyr") install_github("bigrquery") Needs a development environment (http://www.rstudio.com/ide/docs/packages/prerequisites)

29 Google for: split apply combine dplyr

30 Visualisation

31 Studio library(ggplot2) library(bigvis) # Can't use data frames :( dist <- readrds("dist.rds") delay <- readrds("delay.rds") time <- readrds("time.rds") speed <- dist / time * 60 # There's always bad data time[time < 0] <- NA speed[speed < 0] <- NA speed[speed > 761.2] <- NA

32 qplot(dist, speed, colour = delay) + scale_colour_gradient2()

33 One hour later... qplot(dist, speed, colour = delay) + scale_colour_gradient2()

34 x <- runif(2e5) y <- runif(2e5) system.time(plot(x, y))

35

36 user system elapsed

37 Studio Goals Support exploratory analysis (e.g. in R) Fast on commodity hardware 100,000,000 in <5s 108 obs = 0.8 Gb, ~20 vars in 16 Gb

38 Studio Insight Bottleneck is number of pixels: 1d 3,000; 2d: 3,000,000 Process: Condense (bin & summarise) Smooth Visualise

39 Bin x origin width

40 Summarise Count Histogram, KDE Mean Regression, Loess Std. dev. Quantiles Boxplots, Quantile regression smoothing

41 Studio count dist dist_s <- condense(bin(dist, 10)) autoplot(dist_s)

42 Studio user system elapsed count dist dist_s <- condense(bin(dist, 10)) autoplot(dist_s)

43 Studio NA count time time_s <- condense(bin(time, 1)) autoplot(time_s)

44 Studio count time autoplot(time_s, na.rm = TRUE)

45 Studio count time autoplot(time_s[time_s < 500, ])

46 Studio count time autoplot(time_s %% 60)

47 speed count 1e+06 1e+04 1e+02 1e dist

48 speed count 1e+06 1e+04 1e+02 1e sd1 <- condense(bin(dist, 10), z = speed) autoplot(sd1) + ylab("speed") dist

49 user system elapsed speed count 1e+06 1e+04 1e+02 1e sd1 <- condense(bin(dist, 10), z = speed) autoplot(sd1) + ylab("speed") dist

50 speed 400.count 6e+05 5e+05 4e+05 3e+05 2e+05 1e+05 0e dist

51 speed 400.count 6e+05 5e+05 4e+05 3e+05 2e+05 1e+05 0e sd2 <- condense(bin(dist, 20), bin(speed, 20)) autoplot(sd2) dist

52 800 user system elapsed speed 400.count 6e+05 5e+05 4e+05 3e+05 2e+05 1e+05 0e sd2 <- condense(bin(dist, 20), bin(speed, 20)) autoplot(sd2) dist

53 Studio Demo shiny::runapp("mt/", 8002)

54 Google for: bigvis

55 Conclusions

56 Visualise bigvis Tidy Transform dplyr Model

Accessing bigger datasets in R using SQLite and dplyr

Accessing bigger datasets in R using SQLite and dplyr Accessing bigger datasets in R using SQLite and dplyr Amherst College, Amherst, MA, USA March 24, 2015 nhorton@amherst.edu Thanks to Revolution Analytics for their financial support to the Five College

More information

Lecture 4: Tools for data analysis, exploration, and transformation: plyr and reshape2

Lecture 4: Tools for data analysis, exploration, and transformation: plyr and reshape2 Lecture 4: Tools for data, exploration, and transformation: and 2 LSA 2013, Brain and Cognitive Sciences University of Rochester December 3, 2013 manipulation and exploration with and Split-combine: wide

More information

Journal of Statistical Software

Journal of Statistical Software JSS Journal of Statistical Software April 2011, Volume 40, Issue 1. http://www.jstatsoft.org/ The Split-Apply-Combine Strategy for Data Analysis Hadley Wickham Rice University Abstract Many data analysis

More information

Journal of Statistical Software

Journal of Statistical Software JSS Journal of Statistical Software MMMMMM YYYY, Volume VV, Issue II. http://www.jstatsoft.org/ The Split-Apply-Combine Strategy for Data Analysis Hadley Wickham Rice University Abstract Many data analysis

More information

Visualising big data in R

Visualising big data in R Visualising big data in R April 2013 Birmingham R User Meeting Alastair Sanderson www.alastairsanderson.com 23rd April 2013 The challenge of visualising big data Only a few million pixels on a screen,

More information

Teaching Precursors to Data Science in Introductory and Second Courses in Statistics

Teaching Precursors to Data Science in Introductory and Second Courses in Statistics Teaching Precursors to Data Science in Introductory and Second Courses in Statistics Nicholas Horton, nhorton@amherst.edu April 28, 2015 Resources available at http://www.amherst.edu/~nhorton/precursors

More information

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT Hands-On Data Science with R Dealing with Big Data Graham.Williams@togaware.com 27th November 2014 Visit http://handsondatascience.com/ for more Chapters. In this module we explore how to load larger datasets

More information

Big data in R EPIC 2015

Big data in R EPIC 2015 Big data in R EPIC 2015 Big Data: the new 'The Future' In which Forbes magazine finds common ground with Nancy Krieger (for the first time ever?), by arguing the need for theory-driven analysis This future

More information

Getting started with qplot

Getting started with qplot Chapter 2 Getting started with qplot 2.1 Introduction In this chapter, you will learn to make a wide variety of plots with your first ggplot2 function, qplot(), short for quick plot. qplot makes it easy

More information

Your big data options M AY 2 1, 2 0 1 4

Your big data options M AY 2 1, 2 0 1 4 M A N I P U L AT I N G B I G DATA I N R R O B E RT J. CA R R O L L M AY 2 1, 2 0 1 4 This document introduces the data.table package for fast manipulation of big data objects. This is but one option among

More information

Scientific data visualization

Scientific data visualization Scientific data visualization Using ggplot2 Sacha Epskamp University of Amsterdam Department of Psychological Methods 11-04-2014 Hadley Wickham Hadley Wickham Evolution of data visualization Scientific

More information

Data Visualization with R Language

Data Visualization with R Language 1 Data Visualization with R Language DENG, Xiaodong (xiaodong_deng@nuhs.edu.sg ) Research Assistant Saw Swee Hock School of Public Health, National University of Singapore Why Visualize Data? For better

More information

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

SQL Server 2014. In-Memory by Design. Anu Ganesan August 8, 2014

SQL Server 2014. In-Memory by Design. Anu Ganesan August 8, 2014 SQL Server 2014 In-Memory by Design Anu Ganesan August 8, 2014 Drive Real-Time Business with Real-Time Insights Faster transactions Faster queries Faster insights All built-in to SQL Server 2014. 2 Drive

More information

Lecture 25: Database Notes

Lecture 25: Database Notes Lecture 25: Database Notes 36-350, Fall 2014 12 November 2014 The examples here use http://www.stat.cmu.edu/~cshalizi/statcomp/ 14/lectures/23/baseball.db, which is derived from Lahman s baseball database

More information

How good can databases deal with Netflow data

How good can databases deal with Netflow data How good can databases deal with Netflow data Bachelorarbeit Supervisor: bernhard fabian@net.t-labs.tu-berlin.de Inteligent Networks Group (INET) Ernesto Abarca Ortiz eabarca@net.t-labs.tu-berlin.de OVERVIEW

More information

rm(list=ls()) library(sqldf) system.time({large = read.csv.sql("large.csv")}) #172.97 seconds, 4.23GB of memory used by R

rm(list=ls()) library(sqldf) system.time({large = read.csv.sql(large.csv)}) #172.97 seconds, 4.23GB of memory used by R Big Data in R Importing data into R: 1.75GB file Table 1: Comparison of importing data into R Time Taken Packages Functions (second) Remark/Note base read.csv > 2,394 My machine (8GB of memory) ran out

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

More information

VOL. 5, NO. 2, August 2015 ISSN 2225-7217 ARPN Journal of Systems and Software 2009-2015 AJSS Journal. All rights reserved

VOL. 5, NO. 2, August 2015 ISSN 2225-7217 ARPN Journal of Systems and Software 2009-2015 AJSS Journal. All rights reserved Big Data Analysis of Airline Data Set using Hive Nillohit Bhattacharya, 2 Jongwook Woo Grad Student, 2 Prof., Department of Computer Information Systems, California State University Los Angeles nbhatta2

More information

Real-time Big Data An Agile Approach. Presented by: Cory Isaacson, CEO CodeFutures Corporation http://www.codefutures.com

Real-time Big Data An Agile Approach. Presented by: Cory Isaacson, CEO CodeFutures Corporation http://www.codefutures.com Real-time Big Data An Agile Approach Presented by: Cory Isaacson, CEO CodeFutures Corporation http://www.codefutures.com Fall 2014 Introduction Who I am Cory Isaacson, CEO/CTO of CodeFutures Providers

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

Jan 28 th, 2015 FREE Webinar by

Jan 28 th, 2015 FREE Webinar by Google Analytics Data Mining with R (includes 3 Real Applications) Jan 28 th, 2015 FREE Webinar by 1/28/2015 1 Our Speakers Kushan Shah Maintainer of RGoogleAnalytics Library & Web Analyst at Tatvic @

More information

Practical Cassandra. Vitalii Tymchyshyn tivv00@gmail.com @tivv00

Practical Cassandra. Vitalii Tymchyshyn tivv00@gmail.com @tivv00 Practical Cassandra NoSQL key-value vs RDBMS why and when Cassandra architecture Cassandra data model Life without joins or HDD space is cheap today Hardware requirements & deployment hints Vitalii Tymchyshyn

More information

HTSQL is a comprehensive navigational query language for relational databases.

HTSQL is a comprehensive navigational query language for relational databases. http://htsql.org/ HTSQL A Database Query Language HTSQL is a comprehensive navigational query language for relational databases. HTSQL is designed for data analysts and other accidental programmers who

More information

Powering Monitoring Analytics with ELK stack

Powering Monitoring Analytics with ELK stack Powering Monitoring Analytics with ELK stack Abdelkader Lahmadi, Frédéric Beck INRIA Nancy Grand Est, University of Lorraine, France 2015 (compiled on: June 23, 2015) References online Tutorials Elasticsearch

More information

HSD. W Business Analytics (M.Sc.) IT in Business Analytics

HSD. W Business Analytics (M.Sc.) IT in Business Analytics Hochschule Düsseldorf University of Applied Scienses Fachbereich Wirtschaftswissenschaften W Business Analytics (M.Sc.) IT in Business Analytics IT Applications in Business Analytics SS2016 / Lecture 04

More information

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult

More information

Oracle Database In-Memory The Next Big Thing

Oracle Database In-Memory The Next Big Thing Oracle Database In-Memory The Next Big Thing Maria Colgan Master Product Manager #DBIM12c Why is Oracle do this Oracle Database In-Memory Goals Real Time Analytics Accelerate Mixed Workload OLTP No Changes

More information

MySQL Storage Engines

MySQL Storage Engines MySQL Storage Engines Data in MySQL is stored in files (or memory) using a variety of different techniques. Each of these techniques employs different storage mechanisms, indexing facilities, locking levels

More information

Database Scalability and Oracle 12c

Database Scalability and Oracle 12c Database Scalability and Oracle 12c Marcelle Kratochvil CTO Piction ACE Director All Data/Any Data marcelle@piction.com Warning I will be covering topics and saying things that will cause a rethink in

More information

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices Sawmill Log Analyzer Best Practices!! Page 1 of 6 Sawmill Log Analyzer Best Practices! Sawmill Log Analyzer Best Practices!! Page 2 of 6 This document describes best practices for the Sawmill universal

More information

Introduction Course in SPSS - Evening 1

Introduction Course in SPSS - Evening 1 ETH Zürich Seminar für Statistik Introduction Course in SPSS - Evening 1 Seminar für Statistik, ETH Zürich All data used during the course can be downloaded from the following ftp server: ftp://stat.ethz.ch/u/sfs/spsskurs/

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

Introduction to the data.table package in R

Introduction to the data.table package in R Introduction to the data.table package in R Revised: September 18, 2015 (A later revision may be available on the homepage) Introduction This vignette is aimed at those who are already familiar with creating

More information

Synchronization Best Practices

Synchronization Best Practices CWR Mobility Customer Support Program Page 1 of 10 Version [Status] May 2012 Synchronization Best Practices Configuring CWR Mobile CRM for Success Whitepaper Copyright 2009-2011 CWR Mobility B.V. Synchronization

More information

Using the SQL Server Linked Server Capability

Using the SQL Server Linked Server Capability Using the SQL Server Linked Server Capability SQL Server s Linked Server feature enables fast and easy integration of SQL Server data and non SQL Server data, directly in the SQL Server engine itself.

More information

Private vs. Public: Cloud Backup

Private vs. Public: Cloud Backup Tech Brief Private vs. Public: Cloud Backup What You Need To Know With more and more MSPs looking to add cloud backup? services, the decision to build a private or to buy a public cloud requires a close

More information

Big Table in Plain Language

Big Table in Plain Language Big Table in Plain Language Some people remember exactly where they were when JFK was shot. Other people remember exactly where they were when Neil Armstrong stepped on the moon. I remember exactly where

More information

hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau

hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau Powered by Vertica Solution Series in conjunction with: hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau The cost of healthcare in the US continues to escalate. Consumers, employers,

More information

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata Up Your R Game James Taylor, Decision Management Solutions Bill Franks, Teradata Today s Speakers James Taylor Bill Franks CEO Chief Analytics Officer Decision Management Solutions Teradata 7/28/14 3 Polling

More information

SQL DBA Bundle. Data Sheet. Data Sheet. Introduction. What does it cost. What s included in the SQL DBA Bundle. Feedback for the SQL DBA Bundle

SQL DBA Bundle. Data Sheet. Data Sheet. Introduction. What does it cost. What s included in the SQL DBA Bundle. Feedback for the SQL DBA Bundle Data Sheet SQL DBA Bundle Data Sheet Introduction What does it cost What s included in the SQL DBA Bundle Feedback for the SQL DBA Bundle About Red Gate Software Contact information 2 2 3 7 8 8 SQL DBA

More information

ANDROID APPS DEVELOPMENT FOR MOBILE GAME

ANDROID APPS DEVELOPMENT FOR MOBILE GAME ANDROID APPS DEVELOPMENT FOR MOBILE GAME Lecture 7: Data Storage and Web Services Overview Android provides several options for you to save persistent application data. Storage Option Shared Preferences

More information

Your Best Next Business Solution Big Data In R 24/3/2010

Your Best Next Business Solution Big Data In R 24/3/2010 Your Best Next Business Solution Big Data In R 24/3/2010 Big Data In R R Works on RAM Causing Scalability issues Maximum length of an object is 2^31-1 Some packages developed to help overcome this problem

More information

Connecting Software Connect Bridge - Mobile CRM Android User Manual

Connecting Software Connect Bridge - Mobile CRM Android User Manual Connect Bridge - Mobile CRM Android User Manual Summary This document describes the Android app Mobile CRM, its functionality and features available. The document is intended for end users as user manual

More information

Using SQL Monitor at Interactive Intelligence

Using SQL Monitor at Interactive Intelligence Using SQL Monitor at Robbie Baxter 93% of Fortune 100 companies use Red Gate's software Using SQL Monitor at Robbie Baxter Database Administrator Summary Business communications software company has used

More information

Comparing SQL and NOSQL databases

Comparing SQL and NOSQL databases COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations

More information

Internet Map Service Hosting at maphost.co.nz

Internet Map Service Hosting at maphost.co.nz SpatialMedia Internet Map Service Hosting at maphost.co.nz Hosting internet mapping services is the obvious solution for sites who have their web site hosted by an ISP/IPP. Getting online can be as simple

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Report Paper: MatLab/Database Connectivity

Report Paper: MatLab/Database Connectivity Report Paper: MatLab/Database Connectivity Samuel Moyle March 2003 Experiment Introduction This experiment was run following a visit to the University of Queensland, where a simulation engine has been

More information

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Securing and Accelerating Databases In Minutes using GreenSQL

Securing and Accelerating Databases In Minutes using GreenSQL Securing and Accelerating Databases In Minutes using GreenSQL Unified Database Security All-in-one database security and acceleration solution Simplified management, maintenance, renewals and threat update

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Using APSIM, C# and R to Create and Analyse Large Datasets

Using APSIM, C# and R to Create and Analyse Large Datasets 21st International Congress on Modelling and Simulation, Gold Coast, Australia, 29 Nov to 4 Dec 2015 www.mssanz.org.au/modsim2015 Using APSIM, C# and R to Create and Analyse Large Datasets J. L. Fainges

More information

Splice Machine: SQL-on-Hadoop Evaluation Guide www.splicemachine.com

Splice Machine: SQL-on-Hadoop Evaluation Guide www.splicemachine.com REPORT Splice Machine: SQL-on-Hadoop Evaluation Guide www.splicemachine.com The content of this evaluation guide, including the ideas and concepts contained within, are the property of Splice Machine,

More information

Feature Factory: A Crowd Sourced Approach to Variable Discovery From Linked Data

Feature Factory: A Crowd Sourced Approach to Variable Discovery From Linked Data Feature Factory: A Crowd Sourced Approach to Variable Discovery From Linked Data Kiarash Adl Advisor: Kalyan Veeramachaneni, Any Scale Learning for All Computer Science and Artificial Intelligence Laboratory

More information

Oracle Data Miner (Extension of SQL Developer 4.0)

Oracle Data Miner (Extension of SQL Developer 4.0) An Oracle White Paper September 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Integrate Oracle R Enterprise Mining Algorithms into a workflow using the SQL Query node Denny Wong Oracle Data Mining

More information

The Brave New World of Power BI and Hybrid Cloud

The Brave New World of Power BI and Hybrid Cloud The Brave New World of Power BI and Hybrid Cloud Bhavik.Merchant@nec.com.au 27 th August 2015 Agenda Intro Session Goals Short History Lesson Overview of Power BI Components + Demos Transitioning and Future

More information

2015 The MathWorks, Inc. 1

2015 The MathWorks, Inc. 1 25 The MathWorks, Inc. 빅 데이터 및 다양한 데이터 처리 위한 MATLAB의 인터페이스 환경 및 새로운 기능 엄준상 대리 Application Engineer MathWorks 25 The MathWorks, Inc. 2 Challenges of Data Any collection of data sets so large and complex

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

Empower Your Decisions: Maximizing Business Decisions with Data Visualization

Empower Your Decisions: Maximizing Business Decisions with Data Visualization Empower Your Decisions: Maximizing Business Decisions with Data Visualization Forbes, GE, and The MLB all have one thing common that most data driven businesses don't have. They all maximize their business

More information

Media Upload and Sharing Website using HBASE

Media Upload and Sharing Website using HBASE A-PDF Merger DEMO : Purchase from www.a-pdf.com to remove the watermark Media Upload and Sharing Website using HBASE Tushar Mahajan Santosh Mukherjee Shubham Mathur Agenda Motivation for the project Introduction

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Expert Reference Series of White Papers. Introduction to Amazon Relational Database Service (Amazon RDS)

Expert Reference Series of White Papers. Introduction to Amazon Relational Database Service (Amazon RDS) Expert Reference Series of White Papers Introduction to Amazon Relational Database Service (Amazon RDS) 1-800-COURSES www.globalknowledge.com Introduction to Amazon Relational Database Service (Amazon

More information

SQL Databases to access cosmological simulation results. CLUES Workshop. Lyon, 2012 Fernando Campos

SQL Databases to access cosmological simulation results. CLUES Workshop. Lyon, 2012 Fernando Campos SQL Databases to access cosmological simulation results CLUES Workshop. Lyon, 2012 Fernando Campos SQL Databases to access cosmological simulation results Why?? Too big data to handle it on files Easy

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P. SQL databases An introduction AMP: Apache, mysql, PHP This installations installs the Apache webserver, the PHP scripting language, and the mysql database on your computer: Apache: runs in the background

More information

Making OData requests from jquery and/or the Lianja HTML5 Client in a Web App is extremely straightforward and simple.

Making OData requests from jquery and/or the Lianja HTML5 Client in a Web App is extremely straightforward and simple. Lianja Cloud Server supports OData-compatible data access. The Server handles ODBC connections as well as HTTP requests using OData URIs. In this article I will show you how to use Lianja Cloud Server

More information

Spatialite-gui. a GUI tool to manage SQLite and SpatiaLite databases. Just few very short notes showing How to get started as quick as possible

Spatialite-gui. a GUI tool to manage SQLite and SpatiaLite databases. Just few very short notes showing How to get started as quick as possible Spatialite-gui a GUI tool to manage SQLite and SpatiaLite databases Just few very short notes showing How to get started as quick as possible You've just launched spatialite-gui; so you are now facing

More information

Practical Data Science with R

Practical Data Science with R Practical Data Science with R Instructor Matthew Renze Twitter: @matthewrenze Email: matthew@matthewrenze.com Web: http://www.matthewrenze.com Course Description Data science is the practice of transforming

More information

Mike Canney. Application Performance Analysis

Mike Canney. Application Performance Analysis Mike Canney Application Performance Analysis 1 Welcome to Sharkfest 12 contact Mike Canney, Principal Network Analyst, Tektivity, Inc. canney@getpackets.com 319-365-3336 www.getpackets.com 2 Agenda agenda

More information

Software Design Proposal Scientific Data Management System

Software Design Proposal Scientific Data Management System Software Design Proposal Scientific Data Management System Alex Fremier Associate Professor University of Idaho College of Natural Resources Colby Blair Computer Science Undergraduate University of Idaho

More information

How DBA s can improve data access in the enterprise, unlock value and boost productivity.

How DBA s can improve data access in the enterprise, unlock value and boost productivity. How DBA s can improve data access in the enterprise, unlock value and boost productivity. Overview Improving access to enterprise data can unlock value and boost productivity in IT organizations. Current

More information

PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor

PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor The research leading to these results has received funding from the European Union's Seventh Framework

More information

Performance Progress Report

Performance Progress Report U.S. DEPARTMENT OF COMMERCE 2. Award or Grant Number 48-50-M09064 4. Report Date (MM/DD/YYYY) 10-01-2012 1. Recipient Name Connected Nation, Inc - Texas 6. Reporting Period End Date: 09-30-2012 3. Street

More information

Tushar Joshi Turtle Networks Ltd

Tushar Joshi Turtle Networks Ltd MySQL Database for High Availability Web Applications Tushar Joshi Turtle Networks Ltd www.turtle.net Overview What is High Availability? Web/Network Architecture Applications MySQL Replication MySQL Clustering

More information

Big Data Challenges in Bioinformatics

Big Data Challenges in Bioinformatics Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?

More information

Introduction to SQL for Data Scientists

Introduction to SQL for Data Scientists Introduction to SQL for Data Scientists Ben O. Smith College of Business Administration University of Nebraska at Omaha Learning Objectives By the end of this document you will learn: 1. How to perform

More information

Introducing DocumentDB

Introducing DocumentDB David Chappell Introducing DocumentDB A NoSQL Database for Microsoft Azure Sponsored by Microsoft Corporation Copyright 2014 Chappell & Associates Contents Why DocumentDB?... 3 The DocumentDB Data Model...

More information

Connecting Software. CB Mobile CRM Windows Phone 8. User Manual

Connecting Software. CB Mobile CRM Windows Phone 8. User Manual CB Mobile CRM Windows Phone 8 User Manual Summary This document describes the Windows Phone 8 Mobile CRM app functionality and available features. The document is intended for end users as user manual

More information

OPTIMIZATION OF DATABASE STRUCTURE FOR HYDROMETEOROLOGICAL MONITORING SYSTEM

OPTIMIZATION OF DATABASE STRUCTURE FOR HYDROMETEOROLOGICAL MONITORING SYSTEM OPTIMIZATION OF DATABASE STRUCTURE FOR HYDROMETEOROLOGICAL MONITORING SYSTEM Ph.D. Robert SZCZEPANEK Cracow University of Technology Institute of Water Engineering and Water Management ul.warszawska 24,

More information

Data mining as a tool of revealing the hidden connection of the plant

Data mining as a tool of revealing the hidden connection of the plant Data mining as a tool of revealing the hidden connection of the plant Honeywell AIDA Advanced Interactive Data Analysis Introduction What is AIDA? AIDA: Advanced Interactive Data Analysis Developped in

More information

Report Builder. Microsoft SQL Server is great for storing departmental or company data. It is. A Quick Guide to. In association with

Report Builder. Microsoft SQL Server is great for storing departmental or company data. It is. A Quick Guide to. In association with In association with A Quick Guide to Report Builder Simon Jones explains how to put business information into the hands of your employees thanks to Report Builder Microsoft SQL Server is great for storing

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Big Analytics in the Cloud. Matt Winkler PM, Big Data @ Microsoft @mwinkle

Big Analytics in the Cloud. Matt Winkler PM, Big Data @ Microsoft @mwinkle Big Analytics in the Cloud Matt Winkler PM, Big Data @ Microsoft @mwinkle Part 3: Single Slide JustGiving is a global online social platform for giving that lets you raise money for a cause you care about

More information

Conquer the 5 Most Common Magento Coding Issues to Optimize Your Site for Performance

Conquer the 5 Most Common Magento Coding Issues to Optimize Your Site for Performance Conquer the 5 Most Common Magento Coding Issues to Optimize Your Site for Performance Written by: Oleksandr Zarichnyi Table of Contents INTRODUCTION... TOP 5 ISSUES... LOOPS... Calculating the size of

More information

Exploratory Data Analysis for Ecological Modelling and Decision Support

Exploratory Data Analysis for Ecological Modelling and Decision Support Exploratory Data Analysis for Ecological Modelling and Decision Support Gennady Andrienko & Natalia Andrienko Fraunhofer Institute AIS Sankt Augustin Germany http://www.ais.fraunhofer.de/and 5th ECEM conference,

More information

5 Correlation and Data Exploration

5 Correlation and Data Exploration 5 Correlation and Data Exploration Correlation In Unit 3, we did some correlation analyses of data from studies related to the acquisition order and acquisition difficulty of English morphemes by both

More information

Product Guide. Sawmill Analytics, Swindon SN4 9LZ UK sales@sawmill.co.uk tel: +44 845 250 4470

Product Guide. Sawmill Analytics, Swindon SN4 9LZ UK sales@sawmill.co.uk tel: +44 845 250 4470 Product Guide What is Sawmill Sawmill is a highly sophisticated and flexible analysis and reporting tool. It can read text log files from over 800 different sources and analyse their content. Once analyzed

More information

6 Steps to Faster Data Blending Using Your Data Warehouse

6 Steps to Faster Data Blending Using Your Data Warehouse 6 Steps to Faster Data Blending Using Your Data Warehouse Self-Service Data Blending and Analytics Dynamic market conditions require companies to be agile and decision making to be quick meaning the days

More information

Data Visualization in R

Data Visualization in R Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto Oct, 2014 Introduction Motivation for Data Visualization Humans are outstanding at detecting

More information

Predictive Analytics

Predictive Analytics Predictive Analytics How many of you used predictive today? 2015 SAP SE. All rights reserved. 2 2015 SAP SE. All rights reserved. 3 How can you apply predictive to your business? Predictive Analytics is

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

INFORMATION BROCHURE Certificate Course in Web Design Using PHP/MySQL

INFORMATION BROCHURE Certificate Course in Web Design Using PHP/MySQL INFORMATION BROCHURE OF Certificate Course in Web Design Using PHP/MySQL National Institute of Electronics & Information Technology (An Autonomous Scientific Society of Department of Information Technology,

More information

User Guide. Analytics Desktop Document Number: 09619414

User Guide. Analytics Desktop Document Number: 09619414 User Guide Analytics Desktop Document Number: 09619414 CONTENTS Guide Overview Description of this guide... ix What s new in this guide...x 1. Getting Started with Analytics Desktop Introduction... 1

More information