Data Analytics at NICTA Stephen Hardy National ICT Australia (NICTA) shardy@nicta.com.au NICTA Copyright 2013
Outline Big data = science! Data analytics at NICTA Discrete Finite Infinite Machine Learning for the natural sciences NICTA Copyright 2013 2
Data, Data, Everywhere 3
Evolution vs. Revolution Statistics Machine Learning Computer Science problems Personal techniques techniques Societal Challenges Enterprise problems problems Government Scientific Challenges techniques Analysis of data to prove or disprove hypotheses = science!! 4
Not just the data Data Scale Infrastructure Algorithmic complexity Machine learning toolkits Graphical models Volume Analytics Engines SQL / NoSQL Graph learning Deep learning Velocity Variety Distributed computation File systems Random forests Nonparametric statistics Big Data Data Analytics Big Analytics 5
What is NICTA? Australia s National Centre of Excellence in Information and Communication Technology 700 Staff, 5 labs, $100m/y revenue NICTA objectives Research Excellence in ICT Wealth Creation for Australia Transforming Industry $3bn/y direct impact on GDP from projects New Industries Eleven spin-outs, working with ICT SMEs Skills and Capacity 17 University partners, 280 PhD Students NICTA Copyright 2010
Data Analytics: A summary Discrete ℵ P(n i ) Events People Finite R n P(x i ) Signals Location Infinite I P( f i ) Spatial Fields Temporal Fields NICTA Copyright 2013 7
NICTA Data Analytics (1) Discrete ℵ P(n i ) Events, People, Text, Gene Sequences Scoobi data mining / Active learning Energy constrained machine learning Edge-distributed learning Offer targeting Risk Estimation Behaviour prediction Biomedical texts Opinion Watch Event Watch Machine learning for Natural Language Processing Patent analysis Biomedical informatics Sentiment analysis Xenome GWIS Efficient compressed storage and search for sequence data Bioinfomatics NICTA Copyright 2013 8
Event watch Demo http://pmo-eventwatch.research.nicta.com.au/demo/ Sentiment Analysis 40,000 world lexicon Part of Speech Sentiment Key phase extractor Named Entity Recognition LDA: Latent Dirichlet Allocation Differential topic modeling Supervised LDA 9
Key technology - Topic modeling Document 5 Document 4 Document 3 A B C D Vocabulary Document 2 Document 1 1 Probability distribution Topic A Probability distribution 2 Probability distribution Topic B Probability distribution 3 Probability distribution 4 Probability distribution Topic C Probability distribution 5 Probability distribution Topic D Probability distribution Documents consist of words Documents are modeled as a mixture of topics Words are associated with topics Latent Dirichlet Allocation learns the distributions and allocates every word in each document to a topic 10
NICTA Data Analytics (2) Finite R n P(x i ) Signals, Location, Genetics SparSNP Efficient distributed sparse regression method Disease expression Cri$cal(Water(Mains( Non-parametric Bayesian methods Preventative Maintenance Structural(Health(Monitoring( distributed, autonomous, real-time data with classification / clustering Fault Prediction Service optimisation SmartGrid( NICTA Copyright 2013 1 1
12
Machine Learning Process Existing data NICTA s analysis Cond. Assessment Age Type Material Size Length Failures Soil Pressure Location Weather and many more Hierarchical Beta Process Risk / age Risk / type Risk / size Age profile Complex data mix Accurate Improved prediction Data Driven prediction from multiple existing data sources Dynamic model update and aggregation 13
Improvement on failure prediction Use 1998-2008 break records for modelling building Use 2009-2011 break record for testing Multiple factors Laid year, material, size, coating, and soil Failures detected Wollongong NICTA Weibull Length of condition assessment NICTA Weibull NICTA NICTA COPYRIGHT Copyright 2013 2012 zoom in (2.5%) 14
Risk Map Risk ranking of pipes based on likelihood of failure Red = highest Top 10% pipes 10% ~ 40% pipes 40% ~ 60% pipes Last 40% pipes Actual breaks in the following year Blue = lowest
NICTA Data Analytics (3) Infinite I P( f i ) Spatial Fields, Temporal Fields Renewable Energy Solar Energy Forecast Software Geothermal( Groundwater( Did you know failure to predict solar energy production will mean we won t fully capture available solar resources? The Problem Electricity grids around the world were not designed to manage large fluctuations of supply in power generation. Traditional forms of power supply such as coal-fired stations provide a stable, non-fluctuating form of power supply. However, the energy we receive from the sun is much more unpredictable and grids are not designed to cope with the dynamic nature of renewable energy production. Data Fusion with Current prediction methods are not accurate enough the suburb level and not fine-grained enough (i.e. uncertainty estimation currently a matter of days, not minutes). Current methods also require expensive (up to $75,000) and obtrusive equipment in a large area to collect the required data. Resource exploration Soils( ((((((((Air(quality( Solar! Impact google.com.au/images Non-parametric Bayesian methods en.wikipedia.org Resource management NICTA aims to lower the costs of solar monitoring systems to allow for fast, affordable forecast systems to be installed all over Australia. Specifically, we aim to: Develop low-cost devices ($500) that measure current levels of rooftop solar power production by monitoring 150 households across the ACT. Technical Contact Nicholas.Engerer@nicta.com.au Business Contact Jodi.Steel@nicta.com.au Utilise low-cost sky cameras ($250) to detect cloud cover. From these images, NICTA s researchers will project the motion of the clouds and estimate the 'darkness' of their shadows, thereby predicting their inhibitive effect on power output. Develop software that will predict solar energy production by suburb within minutes and hours rather than days. Transparent Machine Learning Resource discovery Plant system diversity Non-linear laser physics Big(Data(Knowledge(Discovery( NICTA Copyright 2013 Collaborators The Solar Energy Forecast Software project is part of NICTA s Security and Environment Business Team, providing security for people, resources and critical systems. Research Excellence in ICT Wealth Creation for Australia 16
Engineered Geothermal Systems
Geophysical Data Gravity Magnetics Core Samples Temperature Reflection Seismic Magnetotellurics Gravity Gradiometry Down-hole Geophysics Stress Porosity Passive Seismic Micro Seismic...
Distributions of geologies Magneto-Telleurics Seismic Magnetism Gravity Probability Distribution
Results fusing gravity & boreholes Predicted mean density and uncertainty 20
Reuse Statistics Machine Learning Computer Science problems Personal techniques techniques Societal Challenges Enterprise problems problems Government Scientific Challenges techniques How can we apply new techniques of machine learning / analytics to science? 21
Machine Learning in the Natural Sciences Big Data Knowledge Discovery Science and Industry Endowment Fund (www.sief.org) project Collaboration between NICTA (machine learning) SIRCA (big data) Sydney Uni (plate tectonics) Macquarie Uni (forest ecosystems, non-linear laser physics) How do we make machine learning easier to use in the natural sciences?
The End