Statistical Computing / Computational Statistics



Similar documents
Call Centers in Berlin

ZAV Germany a member of the European EURES-Network

The Regional Breakdown of the

Unit 5: Quer durch Deutschland. Auf geht's! Cultural Targets

Media Facts. BerlinOnline Stadtportal GmbH & Co. KG

Turin, October Mr. Betz, Tilman and Mrs. Bümmerstede, Julia. Living and Working in Germany

Living & Working in Germany. Sabine Seidler, Welcome to Germany!

Media Facts. BerlinOnline Stadtportal GmbH & Co. KG

LGI s acquisition of KBW effects on competition and bargaining power

The Concept to Measure and Compare Students Knowledge Level in Computer Science in Germany and in Hungary

Cost and Activity Accounting in the German Surveying Administrations

10,59 00,00 10,59 7,47 6,42 Many Topics 5,34 6,83

OPERATING SYSTEM SERVICES

Internationalisation and Health Care Export S. von Bandemer University of Applied Sciences, Gelsenkirchen Institute Work and Technology, Science Park

Microsoft Official Curriculum (MOC) at the COMPAREX Academy

Abstraction in Computer Science & Software Engineering: A Pedagogical Perspective

Data Mining Introduction

7üVNORD. TÜV NORD Röntgentechnik

What is Data Analysis. Kerala School of MathematicsCourse in Statistics for Scientis. Introduction to Data Analysis. Steps in a Statistical Study

Principles for Working with Big Data"

'& ##! %1# ##!!* #!!! 23!!!

Photovoltaic Systems on Roofs Athen, October 8th 2013

INDEX. Introduction Page 3. Methodology Page 4. Findings. Conclusion. Page 5. Page 10

analytics+insights for life science Descriptive to Prescriptive Accelerating Business Insights with Data Analytics a lifescale leadership brief

State of Berlin. Investor Presentation. June Page 1. Land Berlin/Thie

Escaping RGBland: Selecting Colors for Statistical Graphics

GETTING STARTED WITH R AND DATA ANALYSIS

POL 204b: Research and Methodology

Statistics for BIG data

Big Data and Data Science: Behind the Buzz Words

Computational Science and Informatics (Data Science) Programs at GMU

A future career in analytics

HELP DESK SYSTEMS. Using CaseBased Reasoning

Lessons and Insights from

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

AD COPY TEST 1-2/2016. Eye-Tracking

What Makes a Good Test?

SCRATCH Lesson Plan What is SCRATCH? Why SCRATCH?

Today's Topics. COMP 388/441: Human-Computer Interaction. simple 2D plotting. 1D techniques. Ancient plotting techniques. Data Visualization:

Ilmenau University of Technology WELCOME

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

A Visualization is Worth a Thousand Tables: How IBM Business Analytics Lets Users See Big Data

New survey recruiting strategies: Online Panel vs. Mobile Advertising

The RASFF Tasks and Responsibilities of the National Contact Point and the Federal State Contact Points

ANALYTICS A FUTURE IN ANALYTICS

Selected Facts and Figures about Long-Term Care Insurance

A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS

AcademyR Course Catalog

COMPUTING AND VISUALIZING LOG-LINEAR ANALYSIS INTERACTIVELY

Structured Application Development

VISUALIZATION. Improving the Computer Forensic Analysis Process through

REPORT OF MARRIAGE -Requirements for Registration of Marriages in Baden-Württemberg of Hessen only-

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

8. KNOWLEDGE BASED SYSTEMS IN MANUFACTURING SIMULATION

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Website Promotion for Voice Actors: How to get the Search Engines to give you Top Billing! By Jodi Krangle

Real World Application and Usage of IBM Advanced Analytics Technology

A Case for Static Analyzers in the Cloud (Position paper)

Log Analyzer Reference

Intro to scientific programming (with Python) Pietro Berkes, Brandeis University

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

imc FAMOS 6.3 visualization signal analysis data processing test reporting Comprehensive data analysis and documentation imc productive testing

A full spectrum of analytics you can get yourself

Achieving business benefits through automated software testing. By Dr. Mike Bartley, Founder and CEO, TVS

Data Analysis with MATLAB The MathWorks, Inc. 1

Audit Analytics. --An innovative course at Rutgers. Qi Liu. Roman Chinchila

The Assumption(s) of Normality

an introduction to VISUALIZING DATA by joel laumans

2015 Workshops for Professors

Gerard Mc Nulty Systems Optimisation Ltd BA.,B.A.I.,C.Eng.,F.I.E.I

Designing and Evaluating a Web-Based Collaboration Application: A Case Study

imc FAMOS 6.3 visualization signal analysis data processing test reporting Comprehensive data analysis and documentation imc productive testing

Session 15 OF, Unpacking the Actuary's Technical Toolkit. Moderator: Albert Jeffrey Moore, ASA, MAAA

Active Directory Recovery: What It Is, and What It Isn t

Prediction The Evidence it Doesn't Work and a Popular Theory that Leads to Losses

The national scale: Coordination, negotiation and administration of national licences with international publishers (case study from Germany)

Data Science Certificate Program

R language in data mining techniques and statistics

Chapter 13: Program Development and Programming Languages

D is for Science. John Colvin

Better Onboarding to Enable Organizational Agility

How do most businesses analyze data?

Top 3 Ways to Use Data Science

Data Visualization Best Practice. Sophie Sparkes Data Analyst

Professional Organization Checklist for the Computer Science Curriculum Updates. Association of Computing Machinery Computing Curricula 2008

Linguistic information visualization and web services

Why Alerts Suck and Monitoring Solutions need to become Smarter

FIVE STEPS FOR DELIVERING SELF-SERVICE BUSINESS INTELLIGENCE TO EVERYONE CONTENTS

Defense Technical Information Center Compilation Part Notice

What it takes to be a World Class Industrial Engineer

CS420: Operating Systems OS Services & System Calls

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Teaching Biostatistics to Postgraduate Students in Public Health

Undergraduate Advising FAQ

Lecture 9: Data Mining, Data Analytics and Big Data

Objectives. Chapter 2: Operating-System Structures. Operating System Services (Cont.) Operating System Services. Operating System Services (Cont.

Interreg CENTRAL EUROPE Programme Announcement of the first call for proposals step 1

The Prophecy-Prototype of Prediction modeling tool

StARScope: A Web-based SAS Prototype for Clinical Data Visualization

Dual Training at a Glance. Last update: 2007

Transcription:

Interactive Graphics for Statistics: Principles and Examples Augsburg, May 31., 2006 Department of Computational Statistics and Data Analysis, Augsburg University, Germany Statistical Computing / Computational Statistics

Computational Statistics / Statistical Computing In Short 2 Statistical Computing / Computational Statistics S C / C S 1 / 1 Portugal : Nederlands

Computational Statistics / Statistical Computing Definition (for now) Statistical Computing 3 Providing the computational tools for statistics by using tools and techniques from computer science Computational Statistics Doing statistics by using computer based tools

Computational Statistics / Statistical Computing Statistics & Computing: Past Started as Offline Computing 4 Interaction was often limited to the operator Stat Computing = Software Development Stat Graphics a step back for a long time then BLNFYs most system still inherit their graphics system from pen plotter times Many heterogeneous libraries Stat Computing = Computational Statistics

Computational Statistics / Statistical Computing 5 What gained Statistics from Computing? What can be done without a computer? Tests Models Graphics What can be done better with a computer? Tests, Models, Graphics Tests, Models, Graphics for larger data What can not be done without a computer? Advanced Models Simulations Interactive Graphics and Computing Statistical learning algorithms

Computational Statistics / Statistical Computing 6 (Mathematical) Statistics: Two Views An application of mathematics distribution based small to midsize samples emphasis on significance fitting data to models computing helps to turn models into software Supporting Data Analysis data based complex data rather than samples emphasis should be on relevance fitting models to data computing helps to generate hypotheses

Computational Statistics / Statistical Computing 7 Stat Learning & Computer Science Why computer scientists steal the show from statisticians better marketing database savvy What is Data Mining? Pregibon: Statistics at scale and speed Ripley: Machine learning is statistics minus any checking of assumptions Data Mining is machine learning with better marketing Is Stat Computing shifting away from Statistics?

Computational Statistics / Statistical Computing Statistics & Computing: Today 8 Everybody does it with/for R (true only for stat computing) Computational Statistics still done with SAS, SPSS,? Fewer and fewer (standalone) software projects in statistical research. Is this good or bad? Computational Statistics 100% Statistical Computing ε, ε small; how small? Impact of research on commercial software is still rather small.

Computational Statistics / Statistical Computing 9 Who does Statistical Computing? Past: real programmers Today: Statisticians Domain Experts Computer Scientists more and more non-programmers Future: Is writing an R function statistical computing? Is writing an R package statistical computing? Is writing an R successor statistical computing?

Computational Statistics / Statistical Computing 10 Software for Data Mining: Web Survey Data Sources Data Preparation

Computational Statistics / Statistical Computing 11 Software for Data Mining: Web Survey (cont.) Techniques Tools

Statistics and Computers Efron (2001): Statistics in the 20th century Applications 1900 Theory? Methodology 1950 2000 Mathematics Computation

Persistent Communication

Persistent Communication

Persistent Communication

Persistent Communication

Persistent Communication

What is statistical software? A program, which takes numbers as input and creates tables (and figures)? A (collection of) program(s) for exploration, inference and modelling? A tool for administration, manipulation and analysis of data? A medium of communication with CPU (graphics card, printer,... ) of our computer?

Communication forms Pictoral languages: Easy to learn, often universally understood, limited in complexity and expressions. Examples: Austrian road signs, Apple-GUI, etc. Written languages: Hard to learn, need choice of language, almost unlimited in complexity and expressions. Examples: Austrian German, UNIX-Shell, etc. Iliad and Odyssey by Homer have survived until today. The mythology of ancient Egypt may have been as complex as the one of ancient Greece, but try the exercise of encoding the first chapter of the Iliad in icons without loss of information! OK, writing it as a shell script isn t that much easier...... unless the script is well documented!

Relevance for data analysis Statistical software is a tool to tell the computer how to analyse your data. For simple analyses simple forms of communication are sufficient, but even the most complex GUI has only a finite number of submenus. Only programmable environments are (almost) unlimited. S has won the oscar for software systems (ACM award), S S and S S are still waiting for it.

Relevance for data analysis R has become the probably most popular tool for programming with data within academic statistics over the last years. Using a common language (Latin, English) has always been a key to success for science to avoid reinventing the wheel. This is the first time that we (as a community and as a profession) own the computational environment we use for more and more of our data analyses. Publishing data and R code makes statistical research truly reproducible.

Ex: German elections 2005 CSU Bayern SPD Bayern 2005 0.2 0.3 0.4 0.5 0.6 2005 0.2 0.3 0.4 0.5 0.6 0.45 0.50 0.55 0.60 0.65 0.70 2002 0.20 0.25 0.30 0.35 2002

Ex: German elections 2005 CDU/CSU Deutschland SPD Deutschland 2005 0.1 0.2 0.3 0.4 0.5 0.6 2005 0.1 0.2 0.3 0.4 0.5 0.6 0.2 0.3 0.4 0.5 0.6 0.7 2002 0.2 0.3 0.4 0.5 0.6 2002

Ex: Mixture model for SPD 0.2 0.3 0.4 0.5 0.6 0.2 0.3 0.4 0.5 SPD Deutschland 2002 2005

Ex: Mixture model for SPD > table(btw2005$bundesland, cluster(f1)) 1 2 Baden-Württemberg 0 37 Bayern 0 45 Berlin 0 12 Brandenburg 10 0 Bremen 0 2 Hamburg 0 6 Hessen 0 21 Mecklenburg-Vorpommern 7 0 Niedersachsen 0 29 Nordrhein-Westfalen 0 64 Rheinland-Pfalz 0 15 Saarland 4 0 Sachsen 16 1 Sachsen-Anhalt 10 0 Schleswig-Holstein 0 11 Thüringen 9 0

R only??? OK, until now I have done the expected to create a (hopefully) strong case advocating the usage of R. Who would have problems turning the following (spoken) directions into proper R code? zett ist gleich l m klammer auf ypsilon tilde x beistrich data ist gleich d klammer zu Several of my students in Munich had and it took me quite some time to figure out why.

Want some sweets? I guess most people will agree that the existening differences of laguages, cultures, etc. (E-DOLCE) makes the world a much more interesting place to live in. E-DOLCE wastes resources, but is also a source of inspiration. (A dash of) Breiman: Machine learning has reinvented a lot of things that statisticians already knew, but it also provided many new ideas which have subsequently been adopted by us. There are many good reasons, why many road signs feature icons rather than words.

Challenges for the future It took mankind quite some time to develop the languages we speak and write today, it would be naive to assume that we already have the definite language for statistical data analysis. We need better (and most likely radically different) ways to specify large and complex models deal withy large data sets make our tools safely available to practitioners YOUR INPUT

Computational Statistics / Statistical Computing 13 Statistical Graphics: Two Worlds? Stat graphics is widely used; but mainly for diagnostics Most diagnostic plots do not really visualize the model, but residuals or other model characteristics Understanding diagnostic plots is usually more complex than understanding plots of the raw data Chart Junk! There are as many stupid graphics out there as stupid models, With graphics one would hope to find out more easily (e.g. check the graphics section of The American Statistician regularly) The data visualization community often show us how not to do it

Computational Statistics / Statistical Computing Graphic for Data Cleaning 14 Data cleaning and preparation takes most of the time for most of the projects Hardly taught in any statistical curricula Finding the round tag in the square hole is far easier with graphics than with any other tool Undiscovered error and artifacts in the data often end up as the signal in the data, or at least will obscure the real thing What are standard tools and procedures for data cleaning? (or do we mostly rely on common sense here?)

Computational Statistics / Statistical Computing Graphics for Data Analysis is not based on formal theory but has proven to be very effective 15 is not taught in standard curricula but used by most statisticians once they left college is not easy to teach a case study based approach is usually most successful is supporting an interactive and iterative exploration of data thus needs interactive software tools is not well supported by most statistics software but there is (at least some) software that helps

Computational Statistics / Statistical Computing 16 Interactive Computing in Statistics Command line input with (almost) instantaneous response was a big advance Interaction with command line is limited Using graphics, new ideas are usually faster generated than you can type the next error-free command Direct manipulation of statistical objects (be it data, models or graphics) speed things up drastically, but that has its limits, too Multiple views of statistical objects are crucial (although we might work sequentially, having results in parallel will better support cognitive reasoning)

Computational Statistics / Statistical Computing On the Interface 17 Graphical user interfaces (GUIs) started the Desktop Revolution As with any revolution, things fail, if the initial impetus is lost (There are GUI guidelines and books from D.A. Norman and J. Nielsen are fun) There was a great momentum in stat computing 20 years ago to leverage this technological advance Most software projects either died or stalled (DataDesk is still great, but essentially the same software since 1994) What about R s user interface? Too expensive & and not rewarded (or do you know someone who got tenure for great software?)

Computational Statistics / Statistical Computing 18 What makes Graphics interactive? Direct manipulation is at the core (classical computational statistics was 90% algorithm and 10% interface Interactive Statistical Graphics is 10% algorithm and 90% interface!) Selection of subgroups, i.e., conditioning Modification of plot parameters What makes a graphics interactive? data can be selected (selection state is actually only one possible attribute of the data) support for highlighting objects can be queried Linking Attributes can be linked across multiple views (graphics, summaries, models, )

DEMO Computational Statistics / Statistical Computing 19

204 353 Computational Statistics / Statistical Computing Enhancing Graphics 20 Add results from numerical methods from classical statistics to graphical methods (or vice versa). Idea: Make visible what is not immediately visible or quantifiable. Examples log-linear models Low High Hard Medium Soft Hard Medium Soft Low High 55 176 Scatterplot smoothers

Computational Statistics / Statistical Computing Main Questions 21 Statistical Computing and Computational Statistics started as one; what is the state today? What is the relation of computations and graphics in statistics; are they both computing? Does statistical computing leverage current technologies from computer science? Is the gap between users and developers increasing or decreasing? Which side are you on; or where do you want to be next week?