Using Statistical data formats in visualization
Background Statistics explorer: Generic statistics visualization
Background Focus is on visualization, but that is useless without data and data is useless without an easy way to load it.
Background
Background
Background Data Providers Loaded Indicators Selected Indicators
Background Data loading demo Start off on a bright note Download PC-Axis from SCB Load directly into Statistics explorer or Mdim explorer http://www.scb.se/pages/listwide 259087.aspx http://www.ssd.scb.se/databaser/makro/visavar.asp? yp=duwird&xu=c5587001&lang=1&langdb=1&fromw here=s&omradekod=be&huvudtabell=befolkningny &innehall=folkmangd&prodid=be0101&deltabell=k2 &fromsok=&preskat=o
Background To make our tool useful, it needs: Support the most common formats Combine data from different sources Load data in a intuitive way Should be easy to understand WHY data is loaded in a specific way Tell the user what is wrong with their data
Background To make our tool useful, it needs: Support the most common formats Combine data from different sources Load data in a intuitive way Should be easy to understand WHY data is loaded in a specific way Tell the user what is wrong with their data
Formats Generic Formats Excel txt CSV Statistics Formats PC-Axis SDMX
Generic Formats User are guided to use our structure Simpler to have special additions like categorical data and groupings Proper error management and feedback goes a long way Make sure the user knows what is wrong Limits the user to supported structures Their export format either needs specific support OR they need to edit their files Problematic to keep track of and update data
Excel: Categorical Example Categorical Numerical
Excel: Categorical Numerical Categorical
Excel: Categorical Treemap Numerical Categorical
Excel: Categorical Color Map Numerical Categorical
Statistics Formats Strictly structured Has identifiable properties that can be used by our tools Dimensions Values Time Meta data
Statistics Formats Exported data can directly be used in tools which support the format No need for editing or changing data bases as long as they support proper export mechanisms Potentially much simpler to update and manage the tools data.
Common issues - Notation Contents Spatial Countries, Regions Extra important if the tool uses a map Identified in different ways depending on the publisher, language and data set. region, country, geo, cou, location etc. Usage of codes and/or names differs as well ISO 2/3, local code systems, only names
Common issues - Notation Contents Spatial Need to prompt the user to identify the spatial dimension PC-Axis prompt in Statistics explorer, Reading a Finnish language PC-Axis file SDMX Load interface in Statistics explorer, Loading fields for both files, along with location identifier
Common issues - Notation Contents Spatial Problem do exist for other formats as well, but there are fewer options Prompt when reading an Excel file with data on both sheets and columns, where they couldn t be correctly identified.
Common issues - Notation Contents Time 2012-05-31 05-31-2012 Q2-2012 2012-Q2 January, February Etc.. Our tools currently don t care, they only assume it can be sorted alphabetically. Plans on using proper Date standards exist, but there are many localization issues.
Common issues - Notation Contents Dimensions Any number of value dimensions Gender: Men, Women Population: Age 0-14, Age 15-64, Age 65+ Title and Description fields How should these be combined in the application?
Common issues - Notation
Common issues Notation - Example How the structure of PC-Axis is used in explorer: TITLE: Title of the file CONTENTS: Contents of the file STUB: dimensions HEADING: dimensions VALUES: Contains the content of dimensions DESCRIPTION: Description of the file
Common issues Notation - Example Example TITLE: Population numbers by gender CONTENTS: Population STUB: regions HEADING: gender, time VALUES( gender )= Men, Women VALUES( time )= 2000, 2001, 2002 VALUES( region )= Norrköping, Linköping Name of the indicators would be: Population, Men and Population, Women
Common issues - Notation- Example Example from SCB TITLE: Statistics focused on sick leave numbers by region, time and value CONTENTS: Statistics focused on sick leave STUB: regions, variables HEADING: time, indicators VALUES( variables )= Total, Men, Women VALUES( indicators )= Sick leave, days, Percentage who contributes to sick leave, per cent" Name of the indicators would be: Total, Sick leave, days, Statistics focused on sick leave
Common issues - Notation- Example Leaves work for the user, to make sure their file has a structure that fits what we do. Being more flexible in the tool could help, but make it more complex to read data.
Common issues Usage of special characters () ; All cases has to be correctly identified Quite possible and simple, but time consuming
SDMX Our tools can read: SDMX-ML: XML based format It needs two files: DSD: Data structure definition Data Location/regional dimension has to be identified We use an Open Source project: flex-cb, previously developed by ECB.
SDMX OECD: DotStat integration explorer component viewer: Single view app. Integrated into the database Allows direct viewing of data in our graphs User select data Query URL OECD web service SDMX data
SDMX Testing with SCB and Eurostat Evaluating usage of SDMX For regular users? What kind of files are suitable Usually very large files, for database communication Finding bugs No SDMX implementation seems to be the same Both in our reader and the export functionality
SDMX Often completely irrelevant to the normal user Extremely powerful for technical users Hard to use, but better tools will solve this
Web services Best way of acquiring data for normal users Format is irrelevant, black-box approach Example: World databank
Web services Standards? World databank uses its own API and data format
Wrapping up Most common format is Excel Statisticians don t want a black box format Harder to detect errors in files PC-Axis used by a certain group of people They are usually experienced with PC-Axis editing. SDMX is only used by technical experts Used for data export and webservices Quite heavily promoted From our point of view it s hard to know the focus of it Mostly used for large files, transferred between databases
Wrapping up Need more structure? Not at all! A flexible system will always be better Guidelines are important Usage of codes and structures Know your audience Make sure they have options on data structure, and that it is clear how to reach it.