Data Science Center Eindhoven. Big Data: Challenges and Opportunities for Mathematicians. Alessandro Di Bucchianico



Similar documents
Information Management course

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

Total Survey Error: Adapting the Paradigm for Big Data. Paul Biemer RTI International University of North Carolina

Bayesian networks - Time-series models - Apache Spark & Scala

Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the school year.

MATHEMATICAL METHODS OF STATISTICS

Java Modules for Time Series Analysis

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

Statistics Graduate Courses

PS 271B: Quantitative Methods II. Lecture Notes

The Basics of Graphical Models

The Optimality of Naive Bayes

1 Short Introduction to Time Series

A Statistical Framework for Operational Infrasound Monitoring

NEUROMATHEMATICS: DEVELOPMENT TENDENCIES. 1. Which tasks are adequate of neurocomputers?

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

Principles of Data Mining by Hand&Mannila&Smyth

BIG DATA What it is and how to use?

Statistical Models in Data Mining

Sanjeev Kumar. contribute

Recommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1

A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University

Managing Incompleteness, Complexity and Scale in Big Data

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

South Carolina College- and Career-Ready (SCCCR) Pre-Calculus

5 Systems of Equations

CRLS Mathematics Department Algebra I Curriculum Map/Pacing Guide

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

CALIBRATION OF A ROBUST 2 DOF PATH MONITORING TOOL FOR INDUSTRIAL ROBOTS AND MACHINE TOOLS BASED ON PARALLEL KINEMATICS

Advanced In-Database Analytics

Reduced echelon form: Add the following conditions to conditions 1, 2, and 3 above:

03 The full syllabus. 03 The full syllabus continued. For more information visit PAPER C03 FUNDAMENTALS OF BUSINESS MATHEMATICS

Deliverable D7.2: Dissemination Plan

THE DYING FIBONACCI TREE. 1. Introduction. Consider a tree with two types of nodes, say A and B, and the following properties:

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

PRE-CALCULUS GRADE 12

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

Unsupervised Data Mining (Clustering)

Remco van der Hofstad a & Nelly Litvak b a Department of Mathematics and Computer Science, Eindhoven

Polynomial and Synthetic Division. Long Division of Polynomials. Example 1. 6x 2 7x 2 x 2) 19x 2 16x 4 6x3 12x 2 7x 2 16x 7x 2 14x. 2x 4.

How To Understand And Solve Algebraic Equations

Introduction to Engineering System Dynamics

Solving simultaneous equations using the inverse matrix

PROGRAM DIRECTOR: Arthur O Connor Contact: URL : THE PROGRAM Careers in Data Analytics Admissions Criteria CURRICULUM Program Requirements

CAS CS 565, Data Mining

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 12: June 22, Abstract. Review session.

B669 Sublinear Algorithms for Big Data

ModelingandSimulationofthe OpenSourceSoftware Community

Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network

Protein Protein Interaction Networks

Monitoring of Internet traffic and applications

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Structural constraints in complex networks

Grid Density Clustering Algorithm

Introduction to Regression and Data Analysis

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Mixing internal and external data for managing operational risk

not possible or was possible at a high cost for collecting the data.

Architectures for Big Data Analytics A database perspective

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks

Solving Simultaneous Equations and Matrices

Random graphs with a given degree sequence

May 2015 Robert Gibbon & Jochen Stroobants

The Open University s repository of research publications and other research outputs

Graph theoretic approach to analyze amino acid network

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

Algebra 1 Course Information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

BookTOC.txt. 1. Functions, Graphs, and Models. Algebra Toolbox. Sets. The Real Numbers. Inequalities and Intervals on the Real Number Line

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Social Media Mining. Data Mining Essentials

Alabama Department of Postsecondary Education

HIGH PERFORMANCE BIG DATA ANALYTICS

Mathematics (MAT) MAT 061 Basic Euclidean Geometry 3 Hours. MAT 051 Pre-Algebra 4 Hours

Scalable Source Routing

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Stationary random graphs on Z with prescribed iid degrees and finite mean connections

Gerry Hobbs, Department of Statistics, West Virginia University

An Overview of Knowledge Discovery Database and Data mining Techniques

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

What Is Probability?

Integrating a Big Data Platform into Government:

Chapter 4. Polynomial and Rational Functions. 4.1 Polynomial Functions and Their Graphs

Using multiple models: Bagging, Boosting, Ensembles, Forests

Dmitri Krioukov CAIDA/UCSD

UNIVERSITY OF BOLTON SCHOOL OF ENGINEERING MS SYSTEMS ENGINEERING AND ENGINEERING MANAGEMENT SEMESTER 1 EXAMINATION 2015/2016 INTELLIGENT SYSTEMS

Precalculus REVERSE CORRELATION. Content Expectations for. Precalculus. Michigan CONTENT EXPECTATIONS FOR PRECALCULUS CHAPTER/LESSON TITLES

430 Statistics and Financial Mathematics for Business

Florida Math for College Readiness

The Analysis of Dynamical Queueing Systems (Background)

Dimensional analysis is a method for reducing the number and complexity of experimental variables that affect a given physical phenomena.

Forecasting in supply chains

Access Code: RVAE4-EGKVN Financial Aid Code: 6A9DB-DEE3B-74F

Data, Measurements, Features

Transcription:

Data Science Center Eindhoven Big Data: Challenges and Opportunities for Mathematicians Alessandro Di Bucchianico Dutch Mathematical Congress April 15, 2015

Contents 1. Big Data terminology 2. Various mathematical topics 3. Conclusions

What is big data? Term coined by John Mashey, chief scientist at Silicon Graphics in the 1990 s I was using one label for a range of issues, and I wanted the simplest, shortest phrase to convey that the boundaries of computing keep advancing But : chemometrics has a long history of analyzing large data sets

How big is big data? Horizon 2020 EU-ICT16-2015 call: "extremely large" means "so large that today no amount of money could buy a system capable of handling it".

Four V s of Big Data

High-dimensional data: n p

Big Data buzz words scalable algorithm data science / data scientist streaming data data warehousing and ETL in-memory database predictive analytics / predictive modelling high performance computing (exascale computing)

Machine learning and data mining Machine learning focuses on prediction based on known properties learned from training data Data mining focuses on the discovery of (previously) unknown properties of the data Leo Breiman: Two cultures

MapReduce

Open source tools

Course @ Columbia University

Topic: Network structure dependencies between nodes in networks measured by degrees of direct neighbours assortativity coefficient of Newman is nothing but Pearson s correlation statistic inconsistent estimator when variances are infinite Spearman rank correlation behaves better but calculation is computationally intensive requires heavy asymptotics Van der Hofstad, R. and Litvak, N. (2014) Degree-Degree Dependencies in Random Graphs with Heavy-Tailed Degrees. Internet Mathematics, 10 (3-4). pp. 287-334

Topic : Network monitoring Measure Value 0.012 0.01 0.008 0.006 0.004 0.002 0 Al-Qaeda Network Measures Betweeness Density Closeness 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Time C+ Closeness CUSUM Statistic for Al-Qaeda (1994-2004) 8 7 6 5 4 3 2 1 Change Point Signal 0 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Challenges: monitor high-number of variables models to capture structural changes scalable algorithms for likelihood ratio calculations

Topic: Topological Data Analysis Common Big Data problem is to choose relevant features from high-dimensional data Combination of machine learning with topological tools (simplices, cohomology) yields new algorithms for clustering

Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the initial conditions of parameters 2. parametrisation of design building blocks For 1) : polynomial chaos (Wiener chaos), inverse statistical models, Bayesian analysis (calibration) For 2): Model Order Reduction

Topic: Streaming algorithms High volume high velocity data makes exact counting of frequencies or number of different items impossible but approximate answers suffice in big data. New ideas using probabilistic means (random hash functions): Count-Min Sketch MinHash

Topic: Compressed Sensing

Topic: Recommendation systems Approach (Moitra): Use connection to existence theorems for polynomials solutions to algebraic equations and develop scalable algorithms for nonnegative matrix factorizations

Statistical challenges Big Data shows phenomena not present in small data : heterogeneity spurious correlations noise accumulation incidental endogeneity This requires critically revising statistical models and developing new tools

Big Data Research Funding Big Data Challenges Digital Sciences High Performance Computing

Mathematical contributions in general Modelling Performance of algorithms Statistical thinking We need to work hard to make mathematical contributions more explicit.

Actions get involved in local data science centres get into touch with NWO get into touch with EU (e.g., Nov 6 2014 Workshop Mathematics for Digital Sciences ) 2014 IMS presidential address Bin Yu: work on real problems, relevant theory will follow

Journals

Learn from the past

Conclusions interesting mathematics behind big data statistics combinatorics numerical mathematics topology... efforts required learn big data concepts get into touch with computer scientists actions required to ensure funding data science centres funding agencies