Big Data Science. Prof.dr.ir. Geert-Jan Houben. TU Delft Web Information Systems Delft Data Science KIVI chair Big Data Science



Similar documents
Social Data Science for Intelligent Cities

Council of the European Union Brussels, 13 February 2015 (OR. en)

Johan Hallberg Research Manager / Industry Analyst IDC Nordic Services & Sourcing Digital Transformation Global CIO Agenda

Workprogramme

Business Intelligence and Big Data Analytics: Speeding the Cycle from Insights to Action Four Steps to More Profitable Customer Engagement

The Impact of PaaS on Business Transformation

Acceleration for Personalized Medicine Big Data Applications

EL Program: Smart Manufacturing Systems Design and Analysis

How To Handle Big Data With A Data Scientist

Data Center Infrastructure Management. optimize. your data center with our. DCIM weather station. Your business technologists.

Data-Driven Decisions: Role of Operations Research in Business Analytics

How To Help The European Single Market With Data And Information Technology

HP Cloud technologies

Move beyond monitoring to holistic management of application performance

Requirements When Considering a Next- Generation Firewall

FPGA area allocation for parallel C applications

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

MEng, BSc Computer Science with Artificial Intelligence

Customized Report- Big Data

High-Performance Analytics

Big Workflow: More than Just Intelligent Workload Management for Big Data

Dr. John E. Kelly III Senior Vice President, Director of Research. Differentiating IBM: Research

MEng, BSc Applied Computer Science

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers

The Importance of Analytics

BlackStratus for Managed Service Providers

ICT Perspectives on Big Data: Well Sorted Materials

nfx One for Managed Service Providers

CONNECTING DATA WITH BUSINESS

Business Networks: The Next Wave of Innovation

The Massachusetts Open Cloud (MOC)

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Make the Most of Big Data to Drive Innovation Through Reseach

Demystifying Big Data Government Agencies & The Big Data Phenomenon

Convergence is accelerating the path to the New Style of Business

Proposal for the Theme on Big Data. Analytics. Qiang Yang, HKUST Jiannong Cao, PolyU Qi-man Shao, CUHK. May 2015

Top 10 IT Trends that will shape David Chin Chair BICSI Southeast Asia

Three things managers must do to make the most of cognitive computing by Ryan Shanks, Sunit Sinha and Robert J. Thomas

Impact of Internet of Things (IoT) on Industry and Supply Chain

Vinay Parisa 1, Biswajit Mohapatra 2 ;

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Solving for the Future: Addressing Major Societal Challenges Through Innovative Technology and Cloud Computing

Healthcare Measurement Analysis Using Data mining Techniques

Analyzing Big Data: The Path to Competitive Advantage

Optimizing your IT infrastructure IBM Corporation

IBM Tivoli Netcool network management solutions for enterprise

The 5G Infrastructure Public-Private Partnership

Big Data and Healthcare Payers WHITE PAPER

Safe Harbor Statement

Cray: Enabling Real-Time Discovery in Big Data

NSF Workshop: High Priority Research Areas on Integrated Sensor, Control and Platform Modeling for Smart Manufacturing

Process Intelligence: An Exciting New Frontier for Business Intelligence

Advertising Automation SOFTWARE OVERVIEW

Big Data Driven Knowledge Discovery for Autonomic Future Internet

Las Tecnologías de la Información y de la Comunicación en el HORIZONTE 2020

HIGH PERFORMANCE BIG DATA ANALYTICS

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

AdTheorent s. The Intelligent Solution for Real-time Predictive Technology in Mobile Advertising. The Intelligent Impression TM

Panel on Emerging Cyber Security Technologies. Robert F. Brammer, Ph.D., VP and CTO. Northrop Grumman Information Systems.

CHAPTER 1 INTRODUCTION

IFS-8000 V2.0 INFORMATION FUSION SYSTEM

Energy Insight from OMNETRIC Group. Achieving quality and speed in analytics with data discovery

From Lab to Factory: The Big Data Management Workbook

IBM SmartCloud Monitoring

Health monitoring & predictive analytics To lower the TCO in a datacenter

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

Synerscope Sept 2013

Software Development In the Cloud Cloud management and ALM

Expanding the CASEsim Framework to Facilitate Load Balancing of Social Network Simulations

Big Data and Analytics: Challenges and Opportunities

World-Renouned Services

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

Accelerate your Big Data Strategy. Execute faster with Capgemini and Cloudera s Enterprise Data Hub Accelerator

Solution Overview SPECIALIST WAT SPECIALIST WA ER TER BILLING BILLING & CRM CRM

Application Performance Management Is Critical To Business Success

Transcription:

Big Data Science Prof.dr.ir. Geert-Jan Houben TU Delft Web Information Systems Delft Data Science KIVI chair Big Data Science 1

big data: it s there, it s important it is interesting to study it, to understand it, and to know how to engineer it Business opportunities Societal challenges scientifically, we are driven by many questions and unprecedented challenges Organizational & societal innovation Science game changer Questions driving Big Data Science 2

100 billions in economic and societal value Content Creation millions of new jobs and millions of new talent to educate in technology to get knowledge and value out of big data often, (massive amounts of) data from outside the system with properties that systems are grappling with data, too big to handle Big Data can fuel our economy & society 3

Content Creation decentralized & sustainable energy systems smart mobility personalized health smart industry digital society smart enterprises integral water management secure society intelligent living environments Smart sectors rely on Big Data 4

typically information technology, computing science, and a natural focus on software Intelligent Cities the complexity is thought to be in efficiency a prescriptive design approach: closed, fixed, centralized data representing the world is made to fit the software Traditional role of data 5

the Web brought linking data & connecting it to people for adaptation: utility a descriptive approach: open, dynamic, decentralized an unprecedented source of data about the world (that people are part of) - big data, too big to handle Web & Data 6

technology to handle big data asks for a fundamentally new computing science digital (Web) data and its descriptions of the world bring a new complexity to make sense of the data, data science is all about Semantics, w. Scale, Speed & Sustainability Data Gap Computational power Time Complexity in data 7

data science is a new scientific discipline for scientific understanding and creating technology for how to create, process, and understand digital data Data Gap data science is the foundational discipline for engineering data-driven systems Computational power Time Data Science for advancing technology 8

big data sets can be interpreted with new scientific methods the world can be observed in more detail, leading to new knowledge and insight, and potentially better products, service or decisions more detailed reflections of the world come with new questions Data reflects the world 9

how to conclude from heterogeneous, incomplete, unverifiable, sensitive, distributed data with unknown errors? how to relate big data to small data? how to check and repeat analyses? Unprecedented reflections 10

how to prevent false conclusions? how to answer questions without revealing secrets? how to answer questions without stopping time? how to handle ownership of data and satisfy personal and legal contexts? Unprecedented reflections 11

TU Delft coordinating initiative for research, education and training in data science and technology Delft Data Science research & education for technology & talent 12

13

Hardware for Data Science Big Data Computing Systems: application specific computing systems and hardware New Algorithms & Architectures: application and domain specific e.g. finance/bio-informatics/seismic Enabling big data computing systems to adapt to the challenges 14

Software for Data Science Problem: programming multi-core distributed cloud machines with Von Neumann programming languages Problem: data engineers and scientists not trained as software engineers Solution: programming languages that abstract from hardware, close to domain experts Cloud Programming: composing computations using mathematically solid foundations Domain-Specific Languages: enabling software engineers to systematically design & apply DSLs reactive extensions interactive extensions Enabling programmability of big data analytics 15

OPEN QUESTION: HOW TO BUILD SOCIAL SYSTEMS AT SCALE? e-infrastructure Data Management for Data Science More machines Big Data Big Compute Conventional Computation The Future! Social Networking online R&D More people Dave%de%Roure% Graph & Network Data Processing: processing graphs and networks at big data and web scale Web-Scale Graph Indexing Declarative Graph Analysis Languages Humans Interacting in the Process: enabling systems to include human computation & interpretation Crowdsourcing and Human Computation PageRank (Iteration i +1) int N =4847571. // # of nodes in LiveJounal data EDGE (int src: 0..N, (int sink)). EDGECOUNT (int src: 0..N, int cnt). NODES (int n: 0..N). RANK (int iter: 0..10, (int node: 0..N, int rank )). RANK(i +1,n,$SUM(r)) : NODES(n),r =0.15/N ; : RANK(i, p, r1), EDGE(p, n), EDGECOUNT(p, cnt), cnt > 0,r =0.85 r1/cnt. Hand-optimized Java SociaLite Shortest Paths 161 4 PageRank 92 8 Hubs and Authorities 104 17 Mutual Neighbors 77 6 Connected Components 103 9 Triangles 83 6 Clustering Coefficients 84 10 Total 704 60 Crowd Capacity Connected Components Fig. 9. Number of non-commented lines of code for optimized Java programs and their equivalent SociaLite programs. int N =1768195. // # of nodes in Last.fm data EDGE (int src: 0..N, (int sink)). NODES (int n: 0..N). COMP Optimizing (int n: 0..N, int root). Big Data the iteration, the Processing fact that $SUM is not a meet operation Workflows is Data Generation & Data Curation COMPIDS (int id). inconsequential. COMPCOUNT(int cnt). Connected Components. The connected component ID for COMP(n, $MIN(i)) : NODES(n),i= n; each node is simply the minimum of all the node IDs that the : COMP(p, i), EDGE(p, n). node is connected to. The algorithm is expressed simply and COMPIDS(id) : COMP(, id). directly in four rules. The first rule initializes the component COMPCOUNT($SUM(1)) : COMPIDS(id). Enabling of each node to its big own ID. The second rule recursively data sets management at scale Triangles the component ID to the minimum of the component IDs it is int N =1768195. // # of nodes Last.fm data connected to directly. Since $MIN is a meet operation, seminaive evaluation can be applied. Because of the priority queue EDGE (int src: 0..N, (int sink)) orderby sink. TRIANGLE (int x, int y, int z). used to keep track of the component IDs, the lowest values TOTAL (int cnt). propagate quickly to the neighboring nodes. The third and TRIANGLE(x, y, z) : EDGE(x, y),x<y, fourth simply collect up all the unique component IDs and EDGE(y, z),y <z,edge(x, z). count them. TOTAL($SUM(1)) : TRIANGLE(x, y, z). and Triangles. Finding with triangles can be easily described human in interpretation SociaLite. We specify the condition for having a triangle in the Fig. 8. Sample SociaLite programs. first rule. To avoid counting the same triangle multiple times, the edges of the triangles are sorted from smallest to largest. The second rule simply counts up all the triangles. 4 to 17 lines, with a total of 60 lines; Java programs for Discussion. In imperative programming, the programmer these algorithms with comparable performance range from 77 lays out all the data structures and controls the order in which to 161 lines of code, with a total of 704 lines (Figure 9). every piece of data is accessed. SociaLite programmers simply SociaLite programs are an order of magnitude more succinct describe the computations declaratively as recursive SociaLite than Java programs and are correspondingly easier to write. rules; they have some control over the layout and execution (More details on these Java programs will be presented in order by declaring how the data are to be indexed, ordered Section VI-F.) and nested. The choice is limited and the decisions are often PageRank. Unlike most other graph algorithms that seek a obvious. fixed-point solution, PageRank is an iterative algorithm which For example, we see from the sample SociaLite programs runs until a convergence threshold is achieved. Shown in that indexed arrays are used to represent properties of nodes 16 Figure 8 is the code for iteration i +1. Let r = RANK(i, n) be (such as source nodes in graph edges and the iterations for the rank of node n in iteration i, expressed as RANK(i, n, r) PageRank). Relations expected to have common columns, such in the SociaLite program. In each step, the new PageRank is as graph edges, have nested structures. Also, for the Triangles computed with the following formula: program, due to the comparison in the rules, it is useful to RANK(i +1,n)= 1 d N + d X RANK(i, p) sort the sink field, so that a binary search can be used to EDGE(p, n) quickly determine the range of sink values that will satisfy p EDGE(p,n) the predicate. Note that because the sink field is represented where N is the number of nodes in the graph, and d is a by nested tables, sorting is applied to the column of each table, parameter called the damping factor, which is typically set to which is exactly what we need for this algorithm.

Visualisation for Data Science Big Data Visualisation: for real-time visual analytics e.g. medicine, environment Big Data Interaction: for intuitive big data exploration and manipulation Enabling big data visual analytics 17

Science of Social Data Opportunity: data generated by humans, (re)presenting their take on the world Challenge: largest source ever made, with yet-todiscover semantics Social Data Analytics Machines: repurposing social data that is out there, in controlled & well-understood manner - Emergencies & incidents - Intelligent cities - Massive online education Data Creation & Interpretation Machines: including humans & human computing helping software in pro-actively creating & interpreting social data - Social sensing - Workforce engagement - Crowd annotation & knowledge creation Unlocking human-generated data 18

Example: Online Education Learning analytics with Delft Extension School for Open & Online Education analytics to make online education truly learner-centric and to adapt to the students & their backgrounds massive online education is about massively adapting to the context of use with increasing diversity comes importance of social and cultural features: inclusion 19

Unlocking Social Data: Software & Human-enhanced Machines with Well-Understood Properties WIS - Web Information Systems 20

Science of social data 1. Social data gives us one of the largest reflections of the world, but/and it is a man-made reflection unique opportunity turning into interesting research problem 2. Sense & value come from big data, but even more so from what (software and human-enhanced) machines can make of the data V = M * D 3. The power of what machines can do with the data needs to be well-understood and transparent for solid engineering and uptake what machines can do and what they cannot do 4. Science and technology follow the principles of the Web fundamental & experimental WIS - Web Information Systems 21

Data Science for Intelligent Cities Data Science for Environmental Monitoring Data Science for Finance Data Science for Health Data Science for Online Education Data Science for Cybersecurity Data Science in Open Data Data Science for Workforce Management Content Creation Delft Data Science & domains 22

data science comes with new methods for doing science and new ways to collaborate data science labs are in-situ data science research shows the importance of local understanding of how to apply [seemingly global] technology New scientific methods & collaborations 23

Research Agenda Over the coming years, researchers engage with engineers for inspiration, experimentation, and valorization. We invite to collaborate. 24

Next? Today, the first three master classes. Get introduced to three branches of data science. Engage with this research after today. 25

Alexandru Iosup Scalable High Performance Systems 26

Scientific and Societal Challenges How to massivize datacenters? Super-scalable, super-flexible, yet efficient End-to-end automation Dynamic workloads Evolving hardware and software Strict performance, cost, energy, reliability, and fairness requirements The quadruple helix: prosperous society & blooming economy & inventive academia & wise governance Enable data access & processing as a fundamental right in Europe Enable big science and engineering (2020: 100 bn., 1 mil. jobs) To out-compute is to out-compete, but with energy footprint <5% Keep Internet-services affordable yet high quality in Europe The Schiphol of computation: Netherlands as a world-wide ICT hub 27

Alessandro Bozzon Crowdsourcing in Enterprise Environments 28

Challenges Scientific Challenge: How can humans and machines better collaborate in computation problems? Societal (and business) challenges Knowledge Creation Inclusion and Well-being Employment More machines scalability Big Data Big Compute Conventional Computation Human- enhanced Data Management Human Computation Human Intelligence More people 29

Zaid Al-Ars Acceleration of Personalized Medicine Applications 30

Scientific and societal challenges Urgent clinical diagnostics, for example Targeted cancer & neo-natal diagnostics è We provide techniques to reduce compute time Cost prohibitive for society More patients & diseases to be treated è We provide techniques to reduce cost COMPUTE COST COMPUTE TIME 31

gjhouben.nl wis.ewi.tudelft.nl delftdatascience.tudelft.nl 32