1 Data Wrangling: The Elephant in the Room of Big Data Norman Paton University of Manchester
2 Data Wrangling Definitions: a process of iterative data exploration and transformation that enables analysis . the process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semiautomated tools .  S. Kandal, et al., Research Directions in Data Wrangling: Vizualizations and Transformations for usable and credible data, Information Visualization, 10(4), ,  12 th May 2015.
3 Extract, Transform and Load - 1 Of course, this is not completely new, and Extract, Transform and Load (ETL) tools have been around for a significant time. ETL tools support source wrapping, warehouse population, workflow languages, etc. ETL vendors also have big data offerings.
4 Extract, Transform and Load - 2 ETL tools are clearly useful, with products from database vendors and data integration companies. ETL tools emerged to support data warehousing, and thus typically have roots in enterprise settings. ETL tools typically involve significant manual effort. ETL costs no doubt vary widely from project to project, but are quoted as representing up to 80% of the development time in warehousing projects .  S. Kandal, et al., Research Directions in Data Wrangling: Vizualizations and Transformations for usable and credible data, Information Visualization, 10(4), , 2011.
5 Big Data does it make a difference? Big data is sometimes characterised by the 4 V s: Volume the scale of the data, Velocity speed of change, Variety different forms of data, and Veracity uncertainty of data. So size matters but, it isn't everything. Data wrangling for big data must address all four V s at the same time. Classical, substantially manual ETL may struggle with numerous and rapidly changing sources.
6 The Business Case - 1 There is strong support for big data being commercially important: The International Institute of Analytics estimate the Big Data market at $16.1B in 2014, growing 6 times faster than the overall IT market. Projection for 2017 is ~$50B. Gartner (2014) estimates the Data Integration tool market at over $2.2B at end 2013, predicted to rise to ~$3.6B by Gartner (2014) estimates the Data Quality market as $960M in software revenue at end 2012 predicted to rise to $2B by 2017.
7 The Business Case - 2 but many of the potential beneficiaries of big data cannot simply throw resource at data wrangling. The government s Information Economy Strategy states: the overwhelming majority of information economy businesses 95% of the 120,000 enterprises in the sector employ fewer than 10 people.
8 Case Study: e-commerce If you run an e-commerce site, then you need to be able to understand pricing trends among your competitors. This may involve getting to grips with: Volume: thousands of sites; Velocity: sites, site descriptions and contents changing; Variety: in format, content, user community, ; and Veracity: unavailability, inconsistent descriptions, Manual attempts at data wrangling are likely to be expensive, partial, unreliable, poorly targeted,
9 Data Wrangling Research So data wrangling is a research challenge, currently without a community or established priorities. The VADA (Value Added DAta Systems) project seeks to define principles and solutions for adding value to data, supporting users in discovering, extracting, integrating, accessing and interpreting the data of relevance to their questions. VADA takes account of the: user context: requirements such as the trade-off between completeness and correctness; and the data context: availability, cost, provenance, quality.
10 User Context: e-commerce The same application may involve different user contexts. For example: Price comparison may normally be able to work with a subset of high quality sources, but Issue investigation may require a more complete picture, at the risk of obtaining more incorrect data, where sales of a popular item have been falling. As a result, hard-wiring data wrangling tasks risks the production of data sets that are not always fit for purpose, where the reason for this is implicit.
11 Quality VADA Components VADA seeks to support wrangling by integrating: Data extraction, Data integration, Quality analysis and Querying in a best-effort, payas-you-go approach to data wrangling. Integration VADA
12 Example: Data Integration How do we avoid lots of slow, expensive expert input into data integration? In pay-as-you-go data integration, alternative ways of combining data from sources can be generated algorithmically. Automatically generated candidate integrations can be refined in the light of feedback, for example from users or crowds. Decision support techniques can be used to capture the user s requirements (e.g. in terms of quality or cost), in ways that inform which integrations are generated.
13 Example: Mapping Selection - 1 Problem statement: Given a set of candidate mappings, and feedback on their results, identify the subset that best meets the user s requirements in terms of precision and recall. Associated definitions: Precision: the fraction of the retrieved results that are correct. Recall: the fraction of the correct results that are retrieved. The following were among the mappings generated by a commercial schema mapping tool for populating a table with schema <name, country, province> M 1 = SELECT name, country, province from Mondial.city M 2 = SELECT city, country, province from Mondial.located
14 Example: Mapping Selection - 2 We can estimate the quality of the generated mappings using feedback. How much feedback do we need? The results presented: report the precision obtained for a given precision threshold, for different amounts of feedback. report the recall obtained for a given precision threshold, for different amounts of feedback. Khalid Belhajjame, et al., Incrementally improving dataspaces based on user feedback. Inf. Syst. 38(5): (2013).
15 Evaluate Result Precision Threshold
16 Evaluate Result Precision Threshold
17 VADA: All Change The component technologies for extraction, integration and cleaning must themselves: provide automated analyses that are informed by and take account of the user context; share information with each other, so that, e.g., integration can identify issues with extraction, etc; and make well informed decisions that use all available evidence about the data context, such as reference data sets and ontologies. Thus making data wrangling more cost effective and systematic involves a fundamental rethink across a wide front.
18 Conclusions Data wrangling is a problem and an opportunity: A problem because the 4V s of big data may all be present together a lot of the time, undermining manual approaches. An opportunity because if we can make data wrangling much more cost effective, all sorts of hitherto impractical tasks come into reach. Call to arms: we will have a serious go at this in VADA, but there is much to do, and there must be different viable approaches to be taken.
19 Acknowledgements VADA is funded by: The Engineering and Physical Sciences Research Council Through a grant to: Georg Gottlob, Thomas Lukasiewicz, Dan Olteanu, Giorgio Orsi, Tim Furche In cooperation with: Norman Paton, Alvaro Fernandes, John Keane Leonid Libkin, Wenfei Fan, Peter Buneman, Sebastian Maneth
Three steps to put Predictive Analytics to Work The most powerful examples of analytic success use Decision Management to deploy analytic insight in day to day operations helping organizations make more
How to embrace Big Data A methodology to look at the new technology Contents 2 Big Data in a nutshell 3 Big data in Italy 3 Data volume is not an issue 4 Italian firms embrace Big Data 4 Big Data strategies
BUSINESS INTELLIGENCE: FROM DATA COLLECTION TO DATA MINING AND ANALYSIS Appendix W4A for EC organizations can be viewed as either transactional or analytical. Transactional data are those pieces of information
CRISP-DM 1.0 Step-by-step data mining guide Pete Chapman (NCR), Julian Clinton (SPSS), Randy Kerber (NCR), Thomas Khabaza (SPSS), Thomas Reinartz (DaimlerChrysler), Colin Shearer (SPSS) and Rüdiger Wirth
BIG DATA: CHALLENGES AND OPPORTUNITIES IN LOGISTICS SYSTEMS Branka Mikavica a*, Aleksandra Kostić-Ljubisavljević a*, Vesna Radonjić Đogatović a a University of Belgrade, Faculty of Transport and Traffic
Ville-Pekka Peltonen Software Asset Management Current state and use cases Helsinki Metropolia University of Applied Sciences Master of Business Administration Master's Degree Programme in Business Informatics
White paper: Management Tieto 2013 is a strategic business asset are you managing it as such? White paper: Management Tieto 2013 Management the right decisions and actions at the right time based on correct
ADVISORY SERVICES Transforming Internal Audit: A Model from Data Analytics to Assurance kpmg.com Contents Executive summary 1 Making the journey 2 The value of identifying maturity levels 4 Internal audit
THE ROYAL INSTITUTE OF TECHNOLOGY Business Intelligence for Small Enterprises An Open Source Approach Rustam Aliyev May 2008 Master thesis at the Department of Computer and Systems Sciences at the Stockholm
Big Data: Beyond the Hype Why Big Data Matters to You White Paper BY DATASTAX CORPORATION October 2013 Table of Contents Abstract 3 Introduction 3 Big Data and You 5 Big Data Is More Prevalent Than You
Defining and Testing EMR Usability: Principles and Proposed Methods of EMR Usability Evaluation and Rating HIMSS EHR Usability Task Force June 2009 CONTENTS EXECUTIVE SUMMARY... 1 INTRODUCTION... 2 WHAT
Analysis of Data Virtualization & Enterprise Data Standardization in Business Intelligence Laljo John Pullokkaran Working Paper CISL# 2013-10 May 2013 Composite Information Systems Laboratory (CISL) Sloan
An introduction to Service Integration and Management and ITIL Kevin Holland AXELOS.com White Paper January 2015 Contents Foreword 3 Introduction 4 Models for SIAM 7 Principles and considerations 9 The
An Oracle White Paper June 2009 An Overview of Oracle Business Intelligence Applications Executive Overview... 1 Introduction... 1 The Build Versus Buy Decision... 3 Solving the Data Access Challenge...
TABLE OF CONTENTS Introduction... 3 The Importance of Triplestores... 4 Why Triplestores... 5 The Top 8 Things You Should Know When Considering a Triplestore... 9 Inferencing... 9 Integration with Text
Eindhoven, August 2014 Big Data Opportunities for the Retail Sector A Model Proposal by M.G.H. (Marcel) van Eupen BSc Industrial Engineering & Management Science TU/e 2014 Student identity number 0715154
INTELLIGENT BUSINESS STRATEGIES W H I T E P A P E R Architecting A Big Data Platform for Analytics By Mike Ferguson Intelligent Business Strategies October 2012 Prepared for: Table of Contents Introduction...
Joint UNECE/Eurostat/OECD Work Session on Statistical Metadata (METIS) Generic Statistical Business Process Model Version 4.0 April 2009 Prepared by the UNECE Secretariat 1 I. Background 1. The Joint UNECE
Page 1 of 10 Making Optimal Use of JMX in Custom Application Monitoring Systems With any new technology, best practice documents are invaluable in helping developers avoid common errors and design quality
GLOBAL FINANCIAL SERVICES COMPLIANCE & RISK MANAGEMENT with Bombuz A Big Data & Semantic Web Solution Insigma Hengtian Software Ltd. Bayshore Management Consultants, LLC Table of Contents OVERVIEW... 1
White Paper Best Practices in Duplicate Invoice Detection Author Dr Michael Lawler Updated 10 Sep 2013 Version 1.1 Table of Contents Executive Summary... 3 Background... 4 Challenges... 4 Businesses In