Bayesian networks - Time-series models - Apache Spark & Scala



Similar documents
Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Machine Learning.

Big Data and Analytics: Challenges and Opportunities

Scaling Out With Apache Spark. DTL Meeting Slides based on

How Companies are! Using Spark

Learning outcomes. Knowledge and understanding. Competence and skills

8. Machine Learning Applied Artificial Intelligence

MS1b Statistical Data Mining

How To Create A Data Visualization With Apache Spark And Zeppelin

Analytics on Big Data

Journée Thématique Big Data 13/03/2015

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Architectures for massive data management

Machine Learning with MATLAB David Willingham Application Engineer

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

COMP9321 Web Application Engineering

Statistics Graduate Courses

Big Data Analytics. Lucas Rego Drumond

Spark and the Big Data Library

An Introduction to Data Mining

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Hybrid Software Architectures for Big

Fast Analytics on Big Data with H20

Spark: Cluster Computing with Working Sets

Shark Installation Guide Week 3 Report. Ankush Arora

Moving From Hadoop to Spark

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Sanjeev Kumar. contribute

ANALYTICS CENTER LEARNING PROGRAM

CSE-E5430 Scalable Cloud Computing Lecture 11

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

The Basics of Graphical Models

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Principles of Data Mining by Hand&Mannila&Smyth

HT2015: SC4 Statistical Data Mining and Machine Learning

Unified Big Data Processing with Apache Spark. Matei

HDFS. Hadoop Distributed File System

STA 4273H: Statistical Machine Learning

APPLIED MISSING DATA ANALYSIS

Unified Big Data Analytics Pipeline. 连 城

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

Machine learning for algo trading

Apache Flink Next-gen data analysis. Kostas

Databricks. A Primer

Introduction to Data Mining

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Bayesian Statistics: Indian Buffet Process

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Databricks. A Primer

Supervised Learning (Big Data Analytics)

Big Data Processing. Patrick Wendell Databricks

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

NEURAL NETWORKS A Comprehensive Foundation

Chapter 14 Managing Operational Risks with Bayesian Networks

Management Decision Making. Hadi Hosseini CS 330 David R. Cheriton School of Computer Science University of Waterloo July 14, 2011

DATA MINING IN FINANCE

Introduction to Data Mining

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Pentaho Data Mining Last Modified on January 22, 2007

Statistical Models in Data Mining

Dan French Founder & CEO, Consider Solutions

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Unlocking the True Value of Hadoop with Open Data Science

Automated Machine Learning For Autonomic Computing

BIG DATA What it is and how to use?

Data, Measurements, Features

The Internet of Things and Big Data: Intro

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Azure Machine Learning, SQL Data Mining and R

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

Big Data for Big Intel

A hidden Markov model for criminal behaviour classification

life science data mining

Big Data Analytics Hadoop and Spark

How To Choose A Data Flow Pipeline From A Data Processing Platform

Embracing Spark as the Scalable Data Analytics Platform for the Enterprise

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Eastern Washington University Department of Computer Science. Questionnaire for Prospective Masters in Computer Science Students

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Brett Gaines Senior Consultant, CGI Federal Geospatial and Data Analytics Lead Developer

Big Data and Data Science: Behind the Buzz Words

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Big Data Frameworks Course. Prof. Sasu Tarkoma

Spark: Making Big Data Interactive & Real-Time

Transcription:

Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1

Contents Introduction Bayesian networks Latent variables Anomaly detection Bayesian networks & time series Distributed Bayesian networks Apache Spark & Scala Data Science London Meetup - November 2014 2

Introduction Data Science London Meetup - November 2014 3

Profile linkedin.com/in/johnsandiford PhD Imperial College Bayesian networks Machine learning 15 years Implementation Application Numerous techniques Algorithm programming even longer Scala, C#, Java, C++ Graduate scheme mathematician (BAE Systems) Artificial Intelligence / ML research program 8 years (GE/USAF) BP trading & risk analytics big data + machine learning Also: NYSE stock exchange, hedge fund, actuarial consultancy, national newspaper Data Science London Meetup - November 2014 4

Bayesian networks Data Science London Meetup - November 2014 5

Insight, prediction & diagnostics Insight Prediction / inference Diagnostics / reasoning Automated Supervised or unsupervised Troubleshooting Large patterns Anomaly detection Value of information Anomalous patterns Time series Decision support Multivariate Data Science London Meetup - November 2014 6

Bayesian networks Efficient application of probability DAG Subset of wider probabilistic graphical models Not a black box Handle missing data Probabilistic predictions Both supervised & unsupervised techniques Superset of many well known models Mixture model (cluster model) Naïve Bayes AR Vector AR Hidden Markov model Kalman filter Markov chains Sequence clustering Data Science London Meetup - November 2014 7

Example Asia network U = universe of variables X = variables being predicted e= evidence on any variables Data Science London Meetup - November 2014 8

Example Waste network Data Science London Meetup - November 2014 9

Example the bat (40,000 links) Data Science London Meetup - November 2014 10

Example static & temporal Data Science London Meetup - November 2014 11

Prediction & uncertainty Inputs to a prediction can be missing (null) Discrete predictions have an associated probability, e.g. {0.2, 0.8} Continuous predictions have both a mean and variance, e.g. mean =0.2, variance = 2.3 We can calculate joint probabilities over discrete, continuous or hybrid We can calculate the likelihood / log-likelihood Data Science London Meetup - November 2014 12

Prediction (inference) Basically just probability, but with complex algorithms to perform the calculations efficiently Marginalization Sum (discrete), integrate (continuous) Summing in margins Multiplication Bayes Theorem Exact inference Exact subject to numerical rounding Usually explicitly or Implicitly operating on trees Approximate Deterministic Non-deterministic Data Science London Meetup - November 2014 13

Latent variables Data Science London Meetup - November 2014 14

Latent variables X Y 2.0 7.9 6.9 1.98 0.1 2.1 1.1? 9.1 7.2? 9.2 Data Science London Meetup - November 2014 15

Latent variables Data Science London Meetup - November 2014 16

Parameter learning EM algorithm & extensions for missing data D3 animated visualization available on our website Data Science London Meetup - November 2014 17

Latent variables This is exactly the same as a mixture model (cluster model) This model only has X & Y, but most models have much higher dimensionality We can extend other models in the same way, e.g. Mixture of Naïve Bayes (no longer Naïve) Mixture of time series models A structured approach to ensemble methods? Data Science London Meetup - November 2014 18

Latent variables Algorithmically capture underlying mechanisms that haven t or can t be observed Latent variables can be both discrete & continuous Can be hierarchical (similar to Deep Belief networks) Data Science London Meetup - November 2014 19

Anomaly detection Data Science London Meetup - November 2014 20

Univariate Gaussian pdf Data Science London Meetup - November 2014 21

Anomaly detection log-likelihood This can also be calculated for Discrete, continuous & hybrid networks Networks with latent variables Time series networks Allows us to perform anomaly detection Under the hood, great care has to be taken to avoid underflow Especially with temporal networks Data Science London Meetup - November 2014 22

Anomaly detection -4.97-63.9-9.62-13.7 Data Science London Meetup - November 2014 23

Time series anomaly detection D3 animated visualization available on our website Data Science London Meetup - November 2014 24

Bayesian networks Time series (DBN) Data Science London Meetup - November 2014 25

Sample time series data Multiple time series instances Multivariate (X1, X2) Different lengths Data Science London Meetup - November 2014 26

Time series, unrolled Distributions are shared Data Science London Meetup - November 2014 27

Time series, unrolled lag 1 Data Science London Meetup - November 2014 28

Time series, unrolled lag 4 Data Science London Meetup - November 2014 29

Dynamic Bayesian network (DBN) Data Science London Meetup - November 2014 30

Equivalent model Structural learning algorithms can often automatically determine the links Cross, auto & partial correlations Data Science London Meetup - November 2014 31

Time series We can mix static & temporal variables in the same Bayesian network We can include discrete and/or continuous temporal variables Data Science London Meetup - November 2014 32

Time series & latent variables We can include static or temporal latent variables Discrete or continuous In the same way that we used 3 multivariate Gaussians earlier, we can model mixtures of multivariate time series i.e. model different multivariate time series behaviour E.g. 2 time series may be correlated in a certain range, and anti-correlated in another Data Science London Meetup - November 2014 33

Types of time series prediction (t=time) P(X1@t=4) Returns probabilities for discrete, mean & variance for continuous P(X1@t=4, X2@t=4) Joint time series prediction (funnel) P(X1@t=2, X1@t=3) Across different times P(A, X1@t=2) Mixed static & temporal Log-likelihood of a multivariate time series Anomaly detection Data Science London Meetup - November 2014 34

Distributed Bayesian networks Data Science London Meetup - November 2014 35

Different types of scalability Data size Big data? Connectivity (discrete -> exponential) Network size, Rephil > 1M nodes Inference (distributed) Data Science London Meetup - November 2014 36

Data Algorithm is agnostic to the distributed platform We will look at how it can be used with Apache Spark.NET, Java and therefore derivatives such as Scala Data Science London Meetup - November 2014 37

Apache Spark Data Science London Meetup - November 2014 38

Apache Spark RDD (Resilient distributed dataset) In memory DAG execution engine Serialization of variables Data Science London Meetup - November 2014 39

Apache Spark Cache & iterate Great for machine learning algorithms, including Bayesian networks Scala, Java, Python Data Science London Meetup - November 2014 40

Bayes Server Distributed architecture On each thread on each worker node, Bayes Server simply calculates the sufficient statistics This often requires an inference algorithm per thread/partition This plays nicely with Bayes Server streaming, without any hacking Could be on Hadoop + Spark + YARN, Cassandra, a desktop, or next gen platforms Data Science London Meetup - November 2014 41

Spark integration Moving from Hadoop mapreduce to Spark Proof of concept took a single afternoon Due to agnostic approach & streaming RDD.mapPartitions Spark serialization Use of companion object methods (standard approach) Data Science London Meetup - November 2014 42

Example distributed learning Data Science London Meetup - November 2014 43

Example distributed learning Data Science London Meetup - November 2014 44

Distributed time series prediction Data Science London Meetup - November 2014 45

Scala JVM Functional & OO Statically typed Apache Spark is written in Scala Data Science London Meetup - November 2014 46

Spark streaming Great for real time anomaly detection Data Science London Meetup - November 2014 47

GraphX Machine learning on table data, queried from Graph Data Science London Meetup - November 2014 48

Thank you www.bayesserver.com - download, documentation www.bayesserver.com/visualization.aspx www.bayesserver.com/bayesspark.aspx Apache Spark source code & examples Professional services Training Consultancy Proof of concepts Data Science London Meetup - November 2014 49