ICT Perspectives on Big Data: Well Sorted Materials 3 March 2015 Contents Introduction 1 Dendrogram 2 Tree Map 3 Heat Map 4 Raw Group Data 5 For an online, interactive version of the visualisations in this document, go here: www.well-sorted.org/output/ictperspectivesonbigdata
Introduction Dear participant, Thank you for taking part in submitting and sorting your ideas. This document contains several visualisations of your ideas, grouped by the average of your online sorts. They are: Dendrogram - This tree shows each submitted idea and its similarity to the others. The lower two ideas 'join' the more people grouped those two ideas together. For example, if two ideas join at the bottom, every person grouped those two together. Tree Map - This visualisation presents an 'average' grouping. It is calculated by 'cutting' the Dendrogram at the dashed line so that any items which join lower than that line are placed in the same group. In addition, rectangles which share a side of the same length are more similar to each other than their peers. Heat Map - This visualisation shows a similarity matrix where each idea is given a colour at the intersection with another idea, showing how similar the two are. This is useful to see how well formed a group is. The more red there is in a group (shown by the black lines), the more similar the ideas inside it were judged to be. Raw Group Data - This table shows every submitted idea and its longer description. They are shown in the same order as the Dendrogram (so similar ideas are close to each other) and split into the coloured groups used in the Tree Map. In addition, each idea has been given a unique number so they are easier to find. Page 1
Dendrogram Page 2
Tree Map Page 3
Heat Map Page 4
Raw Group Data Red 1 Data privacy issue Big data offers a great opportunity to tackle societal challenges. But data privacy is one of barriers to maximize such an opportunity. One key research question: novel data processing approaches to enable efficient analysis without violating privacy 2 Security The scale and impact of a big data breach is likely to dwarf scale and impact of data breaches in more traditional systems. The distributed nature of big data systems makes them more vulnerable. Can existing approaches to big data security scale? Blue 3 How Big Data better improve manufacturing process? Big Data will be more prominent with more and more IoT devices getting introduced into manufacturing. Advanced analytics helps decode complex manufacturing processes to improve yield through data prioritization, root causes analysis and process modelling. 4 Building Performance Determinants Research shows higher than expected consumption of energy and resources to create and maintain healthy & comfortable indoor environments. Disparate data sources need to be linked and analysed to identify key determinants for excessive resource consumption Page 5
Green 5 Managing Volume and Velocity of data Complex sensor networks and the like will output actionable data at volumes and velocities that we have not worked with yet. There is huge potential from machine learning and automated control systems taking informed next best actions. 6 Big Data is growing faster than Moore's Law! The volume and velocity of data generation is increasing faster than Moore's law (and other laws for scaling performance of computing systems). How do we engineer compute platforms and software infrastructure to analyse big data workloads? 7 Scalable and Expressive Platforms for Big Data We need new types of distributed software platforms for Big Data processing that are scalable, expressive and performant, but remain intuitive to use. They need to offer a real-time unified view of large sets of streaming and historic data. 8 Capacity Planning for converging digital Networks - dealing with data volumes from multi-sensorscapacity planning from the network- cloud computingsufficient statistics- energy-aware data centres and proxy-clouds 9 Research Data Management (RDM) at scale The costs of storing, curating, sharing and publishing research data at scale demands new techniques for sharing data without copying it and applying analytics without direct access to the data. New levels of collaboration and resource sharing are needed. 10 Reliable software We need software to analyse large quantities of data but the functionality is quite different from 'normal' software. This creates a challenge - how can we have confidence in the software used? Orange 11 Data Wrangling Data Wrangling is the process of collecting and preparing data before analysis can take place. It emerged that data scientists spend 50%-80% of their time for data wrangling (DW). Current approaches to DW are rather ad hoc. A methodology is sorely needed. 12 Determining what data are valuable Data size and rate of creation are predicted to rise rapidly. Much of the data will be of low quality, have limited provenance and be subject to errors. Determining rapidly and efficiently subsets of the data have value and what to keep will be essential. 13 Integration of heterogeneous data Data is now taking many complex forms from time series, images, video, to massive scale sequences of amino acids, clinical images and so forth. How these are to be integrated in a coherent manner that improves SNR is a major open issue. 14 Merging Heterogeneous Big Data Sources The key research challenge in the area of big data in my opinion is the problem of merging heterogeneous Page 6
sources of information, quantifying at the same time the quality of the different data sources and the uncertainty associated to them. 15 Heterogeneous data - mixed data types Objects are described by multiple data types, e.g. a patient may be described by a mixture of structured data, time series representing results of a particular text over time, images, text reports and more. Analysis of such complex data is underesearched 16 Merging Disparate Data Sets (Data Fusion) Developing computational tools to merge and contrast, at large scale, related but distinct data sources. Can we find patterns that are (a) shared or (b) distinct? Can we quantify which data sources provide the most insight and which are contradictory? 17 Big Data Integration Value is increasingly going to be found in latent connections amongst datasets that originate from independent sources. Diversity is both an opportunity and a challenge for predictive analytics techniques that aim to distill knowledge from data. 18 Multi-source information fusion for big data Application of AI techniques and development of data fusion methods for the extraction of salient data for decision support in cluttered, congested and conflicted big data environments. 19 Tools for addressing heterogeneity of big data Scale is not our biggest challenge - it is heterogeneity in form (structured/unstructured data, numbers/text) and format (numerous haphazardly-organised sources). Tools are urgently needed - generalised but powerful enough to be useful analytically. 20 Reference Data It is challenging to create a single database with all data from different intrusion systems. The data in different formats needs to be combined to make an informed decision by the security analyst. 21 Modelling ill-structured data Ill-structured data include unstructured crowd source data (e.g. Social media) and irregular samples (e.g. cross section surveys) at different spatial and/or temporal density/frequency. Most existing analytical tools are suitable for well-structured data. 22 Flexible, scalable and semanticbased approaches Techniques for analysing, modeling and reasoning over heterogeneous and dynamic datasets whose structure may potentially change with time from multiple sources will be needed. This calls for highly flexible, scalable and semantic-based approaches. Page 7
Purple 23 Algorithmic Breakthroughs in Big Data Analytics While MapReduce etc. simplifies parallelism, most big data computations are just sums and counts. New directions promising breakthroughs: parallel optimization and approximation algorithms; streaming & summary techniques; efficient graph/matrix algorithms 24 New algorithmic frameworks Shifting attention in algorithm design and analysis towards (1) relational rather than numerical data (2) data streams rather than sets (3) approximate solutions to imperfectly posed problems (4) randomised methods for partial data 25 Robust and Theory Driven Statistical Modeling Big Data Analytics can produce statistics and features of interest when studying socio-technical phenomena,but what is the theory behind the analytics? Can we incorporate theory from different disciplines,and are we sure the methods used are appropriate? 26 New methods for exploratory data analysis We need to understand properties of microscopic events in large information spaces where information meets coincidentally rather than causally determined. This becomes important for exploratory data analysis to generate hypothesis in big data scenarios. 27 How can we accurately quantify uncertainty? There are established methods for quantifying uncertainty, but these will not be appropriate for big data. For example, some methods involve modelling assumptions. For big data, model errors will lead to substantial underestimation of uncertainty. 28 Combining Big Data with Prior Knowledge Often in machine learning, we want to leverage prior knowledge about the data to improve the analysis. Current approaches include feature design and probabilistic modelling. But it can be hard to express vague knowledge in the language of these formalisms 29 Impedance Mismatch: Advanced Analyses v BD Systems Math/Statistical principles vs algorithms and engineering. Need to know how both sides work and how one is mapped to the other. Guaranteed effects in analyses may be lost by underlying infrastructure. Scalable BIG DATA analytics depends on above. Page 8
Yellow 30 Human-Centred Analytics The black box nature of mining algorithms, need for parameter tuning, difficulty of coping with outliers/bad data requires experts to mine data. We need a humancentred approach using effective visualisation and visual analytics. 31 Big data HCI How can we visualise, summarise, or otherwise expose what is contained in big datasets (or what we learn from analysing them), such that that people can understand and exploit this knowledge? 32 User oriented visualisation and configuration How do we develop methodologies and technology that allow users to explore and configure large complex data spaces and their visualisations? i.e. can we take the data analytics expert out of the loop? 33 Interacting with data Humans are often left out when talking about making sense from data. Users will need to interact with complex data in cognitively demanding situations. Understanding the needs of users in these situations is key to fully exploiting the new data rich lands 34 Remove barriers in the use and understanding Massive volumes of data are generated every second from corporate and public sources which could potentially be relevant to decision-makers. How can we build interfaces to support human decision-making? 35 The Emotions of Data What is humans' emotional connection with data and how does that affect data-based decision making processes? Data-based decision making implicitly assumes rational actors. But humans are not rational. How can analytics support emotional decision making? 36 Delivery of Big Data Analytics Impacts The challenges for big data analytics are not only from the pure analytic parts, say, improving learning algorithms and etc. The key issue is how ordinary people would benefit from those analytic results given the huge knowledge gap in between. 37 Targeted decision support The key challenge is to understand what could be extracted from big data that would allow us to take more informed decisions for particular uses. This would involve collaboration between data scientists and experts in the field of the origin of the data. 38 Scale of the results of big data analysis We do not only need methods that are scaled to cope with big data, we also need to consider how to scale down the results to make sure that we (humans) are able to understand and use all of its potential. 39 Visual Big Data: Images and Video The evolution of the internet from a text to a visual medium is well recognised and especially evident in social media (CicsoTF: by 2018 70% of internet will be video). How do we make sense of this deluge of visual data often with little/no metadata? Powered by TCPDF (www.tcpdf.org) Page 9