Essex Big Data and Analytics Summer School 2015, 24 th - 28 th August 2015

Big Data Analytics Summer School 2015, 24 th - 28 th August 2015 Course code title Category Presenter Level BD001 Introduction to R 5 days BD002 BD003 Big Data Methods in R Science Big Data Leo Schalkwyk, Szymon Walkowiak, UK Data Service, Andrew Harrison, (TBC) R is an interactive computing environment programming language designed for statistical analysis graphics. Extensions to the basic capabilities of R are straightforward to produce share with others. It is widely increasingly used in many Big Data fields of research including bioinformatics. Because of its power flexibility, R is more deming to learn than traditional statistical packages but rewards some initial effort. This course is based tested material that we have been using for nearly 10 years to help research students, postdocs faculty get started in their own data analysis, is refined each time based on feedback. It is aimed at people who may have little or no programming experience. course will emphasize the fundamentals of the R language in an intensive format where each student has a computer 50% of the time is spent on practical exercises, will include a special module on techniques. This course will provide participants with an array of major techniques essential R programming skills in data analysis process of large complex socio-economic datasets. In particular, the participants will be introduced to: basics of Big Data extraction technical requirements for effective Big Data manipulation Methods of Big Data management including sub-setting, data transformations, screening for missing values etc. R packages supporting Big Data manipulation techniques e.g. extracting converting between dates times formats, text mining etc. Descriptive statistics frequency tables for Big Data Libraries facilitating Big Data statistical computation modelling Interactive Big Data visualisation techniques process of Big Data product development course will involve active learning methods with case studies real socio-economic data. Prerequisite(s): A working proficiency in R or attendance at the Summer School s introduction to R Course (BD001) history of science has shown on many occasions the benefits of bringing data sets together. It has also shown that deep insights into the Universe lead to theories that provide elegant explanations behind great unifications of knowledge ( hence data). se theories can be, in many cases, described by mathematical concepts, giving clues to how we should best represent the data in order to aid understing.

Course code title Category Presenter Level BD004 - BD005 BD006 BD007 - Clustering Classification with in R Bayesian Computational Methods with applications (in R) Actuarial/Finan cial modelling with applications in R Introduction to Data Mining Berthold Lausen, Hongsheng Dai Saeed Aldahmani, Spyros Vrontos, Beatriz de la Iglesia, East Anglia, ESRC Business Local Government Data Research Centre Advanced /intermediat e 8 hours (over 2 days) I will describe some of the best understood theories their representations. I will break scientific studies into those of simple, complex complicated systems. New sources of simple complex data may offer our best chance of providing new unifications understing of the causal structures within nature. Whereas complicated sources may offer little hope of inferring causality. short course gives an introduction in cluster analysis (unsupervised learning) classification (supervised learning). concept of k-means clustering hierarchical clustering are discussed applied in R. Linear discriminant analysis, logistic regression, classification regression trees (CART) rom forests are introduced as examples of statistical learning methods. Crossvalidation- bootstrap-methods are applied to assess classifiers. Using R, participants analyse example data sets compute estimates of the misclassification rate of the area under the receiver-operating characteristic (ROC). Prerequisite(s): Basic skills using R, basic concepts in statistics as correlation linear regression. course will first provide a brief introduction on Bayesian analysis then cover Markov chain Monte Carlo (MCMC) methods, such as Metropolis-Hastings algorithm Gibbs sampler. on mixture models, change-point problems regression analysis will also be covered in the lecture. course includes a 2-hour lab session to help audience be familiar with implementing MCMC algorithms using R. Prerequisite(s): Participants should have knowledge of at least first-year statistics probability R. Modelling claim frequency claim severity in general insurance, distribution fitting, application of generalised linear models in pricing, ratemaking bonus malus systems. Modelling the returns of financial assets. Option pricing in finance insurance. Monte Carlo methods their application in option pricing in pricing life insurance liabilities. Extensive applications in R with real simulated data sets. Prerequisite(s): Basic knowledge of statistics R. course will introduce the topic of data mining, will present a methodology for Knowledge Discovery in databases (KDD). tasks of clustering classification will be explored in some detail. We will look at an open source data mining package for some practical guidance on how to put what has been learned to practice.

Course code title Category Presenter Level BD008 BD009 BD010 BD011 BD012 A (gentle) introduction to reinforcement learning Search in big data Practical sentiment analysis High performance computing Data Protection Liability in the Age of Big Legal ethical issues Spyros Samothrakis, Allan Hanbury, Vienna Technology Diana Maynard, Sheffield Adrian Clark, Audrey Guinchard, 6 or 8 hours /advanced 8 hour Reinforcement learning is concerned with learning how to act optimally in the presence of rewards punishments. This short course on reinforcement learning will help you underst the basics provide a solid foundation necessary for advanced topics. It will have both a practical (two hour) a theoretical (two hour) component. Topics to be addressed are Markov Decision Processes, Monte Carlo methods, SARSA Q-Learning. Prerequisite(s): Some mathematical/computer science sophistication (e.g. understing summation, recursion, means/medians). As the amount of text data stored by organisations grows, information retrieval technologies become increasingly important. Effective use of search technologies are essential to ensuring that the key information is available when decisions are made. This course will start by covering the basics of information retrieval, such as indexing keyword search. It will then cover adapting search to specific domains (such as the technical health domains), will finally present how the effectiveness of search technologies is evaluated. Prerequisite(s): participants need to be comfortable in basic mathematics, especially linear algebra. This tutorial will introduce the concept of sentiment analysis from unstructured text. It will cover both rule-based machine learning techniques, provide some background information on the key underlying NLP text analysis processes required, look in detail at some of the major problems solutions, such as detection of sarcasm, use of informal language, spam opinion detection, trustworthiness of opinion holders, so on. techniques will be demonstrated with real applications developed in GATE, an opensource language processing toolkit. Hs-on exercises relevant materials will be provided for participants to try out the applications, to experiment with building their own tools, both in GATE with other common tools. Prerequisite(s): No prior knowledge of GATE, Java or Natural Language Processing (NLP) is required to attend this tutorial. However, it will include a hs-on element where you will be able to try simple things out in GATE, the tool we use for NLP tasks. This course introduces participants to high performance computing. first half of the course will cover principles (floating-point computation, speeding up code, compute clusters, using MPI) while, in the second part, participants will have the opportunity to build use a small cluster. Prerequisite(s): course assumes knowledge of programming in Python/C/C++. This session aims to introduce the current EU UK data protection regime the changes to be brought in by the future General Data Protection Regulation late 2015. Furthermore, the session will present allow for discussion of the specific challenges big

Course code title Category Presenter Level Data Analytics data bring, especially in light of the reports published by various data protection regulators on both at UK EU levels. BD013 BD014 BD015 Managing, curating publishing data Secure access protocols for Big Data Agent based modelling for business Curation management of data Curation management of data Sharon Bolton Louise Corti, UK Data Service, Libby Bishop Felix Ritchie, UK Data Service, Abhijit Sengupta Big data may come from a range of sources organisations, which may not be used to the idea of sharing their data with researchers. refore, they might not realise what researchers need so some of the features traditionally present that make research data easier to use might not be available. This can bring a range of problems, some of which can be addressed by good data curation. course will start with what the legal issues in brokering data. assessment of : issues of trust in quality of the source. Who is the provider? Also, it will highlight ethical issues content use of personal data. For example, some of the questions we plan to address in the session include: Data confidentiality are people identifiable from the data? Metadata accompanying documentation do users have enough information about what the data means how it can be used? Formats, size usability what kind of software, hardware techniques are needed? Publishing data products or data to support a journal article. What does the supporting data look like for verification? Run a hs-on exercise publishing a small datasets in a repository providing necessary metadata documentation. To learn what curation is what is needed. On aspects of accessing using confidential sources of Big Data. Five Safes of data access Big Data confidentiality/privacy/ethical considerations: what you need to know How to be a Safe Person when using confidential sources of Big Data Using Big Data responsibly Designing a Safe Setting for Big Data Disclosure control techniques: to data, to your research outputs objective is to prepare people who want to access confidential sources of Big Data. y might be making an application to a data owner, or for funding which has to go through an ethics panel. Or they might be using Big Data but unaware of some of the confidential/privacy/public-perception issues that surround collection analysis of Big Data. Advanced This course will start by providing students with an overview of the nature of business applications where Agent Based Modelling (ABM) can be useful, relevant practical. It will then proceed with some real world examples where ABM has been used, particularly in the context of the Fast Moving Consumer Goods (FMCG) sector. A few of these examples, which are in public domain have had an academic influence, will be discussed in detail.

Course code title Category Presenter Level BD016 BD017 BD018 Machine Learning with Mahout (tbc) Big Data Finance Analytics Cognitive Computing Richard Skeggs, ESRC Business Local Government Data Research Centre Neil Kellard, Detlef Nauck Martin Spott, British Telecom /advanced lecture will conclude with some indicators of where the future of this modelling paradigm lies in the context of business applications. Prerequisite(s): Understing of complex systems phenomena. Familiarity with social networks properties of networks. Reasonable knowledge of at least one ABM toolkit such as Repast or NetLogo. All practical examples in this course will be NetLogo based. This is an introduction into the use of machine learning algorithms supported by the Apache Mahout framework. class will concentrate on what problems can be solved using Mahout before looking at the common classifiers used by Mahout to achieve those objectives. Finally the class will look at building some simple working examples to see Mahout in practice. Prerequisite(s): Knowledge of the Java programming language is essential. Some statistical knowledge will be useful but not essential. Big data is the term for a collection of data sets so large complex that it becomes difficult to process using on-h database management tools or traditional data processing applications. Given contemporary computing power potential data collection, many firms, particularly those from the financial sector, wish to use. challenges include capture, curation, storage search, sharing, transfer, data visualization. primary purpose of this course is to provide the participant with an understing of data analytic approaches in finance. first part covers high frequency trading predictive. second part will concentrate on the application of data in risk modelling, corporate finance, fraud personal finance. Prerequisite(s): Some background in statistics/mathematics/econometrics is desirable but not essential. While we are successfully addressing the challenges behind storing managing massive amounts of data through technologies, we are still facing large obstacles in successfully quickly analysing that data. view that one analyst uses tools from statistics, machine learning data mining to find answers in data rapidly becomes outdated in the face of an overwhelming amount variety of data an ever increasing dem for evidence based decision making. We now need to look into concepts of collaborative distributed where analysts work together combine individual results to an overall answer. We need tools that can deal with uncertainty can assess the quality of potential answers. We need new human-computer interfaces that allow computers to really help analysts find answers that they could not have come up with themselves. We also need computers help analysts to illustrate explain the outcome of to decision makers so they have confidence in the results. Cognitive Computing addresses several of these issues. Cognitive Computing looks at how we get computers to behave interact the way humans do. Systems like IBM s Watson can deal with huge

Course code title Category Presenter Level BD019 BD020 BD021 BD022 Stream Processing Data Analytics for Smart City Crowdsourcing Human Computation From Big Data to Big Value Introduction to Big Data Statistics TBC Sefki Kolozali Nazli Farajidavar Surrey Jon Chamberlain, Richard Mason, Intel Nathan Cunningham, UK Data Service, /advanced volumes of data, identify knowledge patterns in the data apply this to the problem the analyst is trying to solve by giving them different alternatives to consider in particular the underlying evidence that supports those alternatives. This course looks at the challenges modern is facing explores how ideas from Cognitive Computing can lead to a new era of data. Prerequisite(s): A basic understing of what is involved in running a data science project. In this course we cover some of background concepts related to the Internet of Things Web of Things, Semantic Technologies in the smart city domain will describe solution for processing information extraction from real world data. Use-cases examples from the smart city domain will be described. We will also discuss some of the machine learning techniques data tools methods that can be used to process analyse the smart city data. Prerequisite(s): Familiarity with machine learning techniques semantic web technologies would be useful but is not compulsory Crowdsourcing has established itself in the mainstream of research methodology in recent years, using a variety of methods to engage many non-expert users to solve problems that computers or limited expert users cannot solve. Whilst the concept of human computation goes some way towards solving problems, it also introduces new challenges of data quality, participant recruitment incentivisation. This course will introduce 3 common methods of crowdsourcing: peer-production; microworking games-with-a-purpose, as well as an emerging approach using social networks as a powerful problem solving monitoring tool. Participants are encouraged to bring examples of data they would like annotated or tasks that need humans to solve for discussion as to which approach might be suitable how to implement it. Learn how Intel is harnessing Big Data to drive operational efficiency revenue optimisation across the organisation. Discuss trends how Intel is embracing these trends to gain further insights, adoption value. This is a short introductory course into understing Big Data, what it is what strategies you can adopt to make the most out of it. It would be useful to bring a device for note taking. This course will cover: Putting new knowledge first. What question do you want to answer? Defining metrics for success. What is Big Data? What Big Data solutions are available to me for free? Do you know what your real sample size is?

Course code title Category Presenter Level Testing hypotheses calling things significant. Managing spurious correlations. Smoothing data to understing significant relationships spatial/temporal data Make as small as possible as quick as possible. Plotting your so you don t miss the obvious. Strategies for improving prediction accuracy by averaging many models together. Prerequisite(s): Familiarity with using applying science/research data to answer questions. An understing of statistics how databases operate is desirable. A basic overview of computing infrastructure algorithm will be discussed but at an introductory level assuming no prior knowledge. Keynote Lectures Company Presenter Title of talk Abstract Thomson Jochen Leidner Reuters Small Data Big Data: Qualitative Differences Resulting from Quantitative Scale Intel Mark Woodward Using Big Data to Generate Real Revenue for Business Fujitsu Joe Duran Impact of Research on Computing in Society Citigroup Stuart Jones Bridging the Gap Between Big Data, Statistics Business While the Big Data topic has received a lot of attention, one may wonder why exactly "more of the same should constitute a step change; for instance, we haven t declared a new academic field of "Big Plastic" just because we consume process more plastic than ever. In this talk, I critically assess which, if any, quantitative changes induce qualitative changes, whether the talk of as a new area is merited. Along the way, we will revisit a couple of past ongoing efforts that fall into the space apply these findings. This talk will describe how Intel takes advantage of large, complex data sources to achieve greater efficiency, cost saving new revenue opportunities across its business. As part of the talk, examples of initiatives real world business scenarios in the Technology Manufacturing world will be discussed. TBC This talk will describe how understing business objectives including revenue, expense risk management can be satisfied with statistical analysis of. As part of the talk examples of initiatives that will assist in detecting preventing fraudulent money-laundering activity in the financial world will be discussed.