Second International Workshop on Preservation of Evolving Big Data - Panel on Big Data Quality



Similar documents
Big Data, Integration and Governance: Ask the Experts

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

ESTABLISHING A MEASUREMENT PROGRAM

No Data Governance, No Actionable Insights

IBM Analytics Make sense of your data

Collaborations between Official Statistics and Academia in the Era of Big Data

South East of Process Main Building / 1F. North East of Process Main Building / 1F. At 14:05 April 16, Sample not collected

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Big Data Readiness. A QuantUniversity Whitepaper. 5 things to know before embarking on your first Big Data project

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

The New World of Data. Don Strickland President, Strickland & Associates

Veracity in Big Data Reliability of Routes

Error Log Processing for Accurate Failure Prediction. Humboldt-Universität zu Berlin

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy

Introduction to Data Mining

Indian Journal of Science The International Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

The Next Wave of Data Management. Is Big Data The New Normal?

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May ISSN BIG DATA: A New Technology

Test Data Management in the New Era of Computing

Data collection architecture for Big Data

White Paper. An Overview of the Kalido Data Governance Director Operationalizing Data Governance Programs Through Data Policy Management

Formal Methods for Preserving Privacy for Big Data Extraction Software

IEEE JAVA Project 2012

Information Lifecycle Governance. Surabhi Kapoor & Jan Lambrechts

A Strategic Approach to Unlock the Opportunities from Big Data

Big Data, Big Risk, Big Rewards. Hussein Syed

Information Stewardship: Moving From Big Data to Big Value

Anuradha Bhatia, Faculty, Computer Technology Department, Mumbai, India

Associate Prof. Dr. Victor Onomza Waziri

Big Data - Security and Privacy

Procedia Computer Science 00 (2012) Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu.

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

GenBank, Entrez, & FASTA

A Bayesian Approach for on-line max auditing of Dynamic Statistical Databases

A Contact Center Crystal Ball:

Data Cleansing for Remote Battery System Monitoring

Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data

COMP9321 Web Application Engineering

1. Understanding Big Data

Information Management course

Safe robot motion planning in dynamic, uncertain environments

PICTURE Project Final Event. 21 May 2014 Minsk, Belarus

DGE /DG Connect

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

Generating the Business Value of Big Data:

Example Gantt Chart. PERT charts...) (Later, but next, Source: oc/images/gantt.

Big Data Analytics Hadoop and Spark

German Record Linkage Center

Big Data in Transportation Engineering

Big Data and Analytics: Challenges and Opportunities

The Data Engineer. Mike Tamir Chief Science Officer Galvanize. Steven Miller Global Leader Academic Programs IBM Analytics

High-Frequency Active Internet Topology Mapping

Remote Service. SASG - Big Data From machine design to IT management & Remote Service. Marcel Boosten Philips Healthcare October 7, 2014

Software Development for Medical Devices

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

Introduction to the Mathematics of Big Data. Philippe B. Laval

Overview. Software engineering and the design process for interactive systems. Standards and guidelines as design rules

Introduction to Data Mining

Big Data & Analytics for Semiconductor Manufacturing

Privacy Challenges of Telco Big Data

International Journal of Innovative Research in Computer and Communication Engineering

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

Call topics. September SAF RA joint call on Human and organizational factors including the value of industrial safety

The Big Deal about Big Data. Mike Skinner, CPA CISA CITP HORNE LLP

Host Fingerprinting and Tracking on the Web: Privacy and Security Implications

Ernestina Menasalvas Universidad Politécnica de Madrid

Data Governance, Data Architecture, and Metadata Essentials Enabling Data Reuse Across the Enterprise

Session 4 Cloud computing for future ICT Knowledge platforms

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

White Paper: Assessing Performance & Response Time Requirements

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Ali Eghlima Ph.D Director of Bioinformatics. A Bioinformatics Research & Consulting Group

ICT Perspectives on Big Data: Well Sorted Materials

Data Centric Computing Revisited

Introducing Formal Methods. Software Engineering and Formal Methods

Job Description SF06740

Integrated Risk Management:

Introduction. A. Bellaachia Page: 1

NSF Workshop on Big Data Security and Privacy

SMTP: Stedelijk Museum Text Mining Project

Transcription:

Second International Workshop on Preservation of Evolving Big Data - Panel on Big Data Quality Angela Bonifati University of Lyon 1 Liris CNRS, France March 15, 2016 Angela Bonifati 2nd Diachron Workshop 2016 March 15, 2016 1 / 8

Table of contents 1 Four Questions Angela Bonifati 2nd Diachron Workshop 2016 March 15, 2016 2 / 8

Q 1 : Risk assessment of poor data quality Poor data entails poor data analysis In our ongoing research project (MedClean a ), we work on medical data involving patient data, DNA sequencing data and medical images (as issued by microscopes), that may exhibit inconsistency, incompleteness and noise. Consequently, the diagnoses built on top of them might be affected. Very often, the data cannot be entirely cleaned due to many factors: the absence of authoritative sources, the inherent format of data, privacy issues. Can we characterize the uncertainty of such data and thus express a confidence measure of the subsequent analysis? a Défi CNRS Mastodons 2016, MedClean (PI: AngelaBonifati) Angela Bonifati 2nd Diachron Workshop 2016 March 15, 2016 3 / 8

Q 2 :What are the main quality factors of Big Data? Quality depends on the format Because of one of the Vs of Big Data, namely Variety, many diverse data formats need to co-exist, each of which requires a specific notion of quality. Relational tuples, Graphs, Time series, Images if we restrict to relational tuples, we can identify measures/conceive techniques to ensure data quality (research is somehow mature); if we go beyond the relational models, the setting becomes less clear: for graphs, for instance, do we have a notion of quality? for images, is the quality related to its resolution and its precision? what about time series or DNS sequencing data? Angela Bonifati 2nd Diachron Workshop 2016 March 15, 2016 4 / 8

Q 3 : Challenges for Big Data Quality: the FOUR Vs Volume: scale factor of data involved in the cleaning processes drastically changes (e.g. images of the order of Terabytes are too voluminous to be manipulated by physicians/biologists, difficult to browse etc.) Velocity: time series from clinical data are an example of data in which velocity is rather critical (of the order of Petabytes...); Variety: already mentioned above; an important aspect of our project is about annotation and transformation across different formats. Veracity: trust is very important in health-care decisions and very difficult to guarantee on massive datasets. Can we isolate query-driven snippets on datasets for which we can provide guarantees of quality and trust? Angela Bonifati 2nd Diachron Workshop 2016 March 15, 2016 5 / 8

Q 4 : Quality Assessment and Decision Making (Part I) As mentioned before, sometimes we are unable to perform data cleaning at best for various reasons. Can we use (probabilistic?) approaches to measure the uncertainty of data so that we can also quantify the uncertainty (and the cost) of decision making processes? Even when data cleaning attains high values of precision and recall, we may need to quantify the uncertainty for instance because the involved data will be part of further processing and integration in the sequel and we want to keep track of its incompleteness. Finally, for privacy reasons, we may be in a situation in which we cannot clean the original data sources. We need to come up with assessment and decision making methods that work in a privacy-preserving manner. Angela Bonifati 2nd Diachron Workshop 2016 March 15, 2016 6 / 8

Q 4 : Can Data Quality be ignored? (Part II) Which threshold depends on the actual data formats and on the subsequent tasks of the lifecycle, as well as on the final analysis that needs to be performed. Maybe in some cases, it can but still we need to annotate data with suitable indicators. At what extent can data quality be ignored? We actually do not know... Angela Bonifati 2nd Diachron Workshop 2016 March 15, 2016 7 / 8

Conclusion Big Data Quality Data quality: design of quality-conscious methods. Interactions between the Vs: several open problems out there! State-of-the-art and directions of research Existing large-scale data cleaning methods for relational databases, entity resolution for graphs... Combinations of data formats: are we ready? Angela Bonifati 2nd Diachron Workshop 2016 March 15, 2016 8 / 8