Big Learning Data Management and Data Analysis... for industrial applications Thomas Natschläger +43 7236 3343 868 thomas.natschlaeger@scch.at www.scch.at Das SCCH ist eine Initiative der Das SCCH befindet sich im
SCCH Key Facts application-oriented research organization initiated by institutes of the Johannes Kepler University Linz cooperation science - industry non-profit organization constituted as Ltd owners Johannes Kepler University Linz Upper Austrian Research GmbH Association of Company Partners of SCCH ~ 60 employees (>80 with partners) 5,7 mio euros income incl. subsidies in business year 2010/2011 founded in July 1999 in the realm of the K plus Program since 2008 COMET competence center 2
Research Topics Process and Quality Engineering software engineering software quality process and approaches Rigorous Methods in Software Engineering software specification, verification, validation formal methods (ASM, Event-B, etc.) process modeling, workflows Models, Architectures and Tools software architecture model-based development integration of architecture in development Knowledge-Based Vision Systems machine vision object recognition object tracking Data Analysis Systems automated and intelligent data analysis prediction and optimization knowledge discovery 3
Application Domains DAS - Data Analysis Systems Topics Computational Models Semantic Knowledge Models Knowledge Discovery Machine Learning Stream Data Analysis Data Warehousing Data Management 4
Application Domains DAS - Data Analysis Systems Topics Computational Models Semantic Knowledge Models Knowledge Discovery Machine Learning Stream Data Analysis Data Warehousing Data Management 5
Overview Temporal Analytics on Big Data Applications Fault Detection Proposed Architecture Related Work Learning Big Models Causal Inference Enabled by parallelization Prediction und optimal control 6
Overview Temporal Analytics on Big Data Applications Fault Detection Proposed Architecture Related Work Learning Big Models Causal Inference Enabled by parallelization Prediction und optimal control 7
Domain: Industrial Production system 1 system 2 system i system n PIMS Subsystems generate streams of sensor data Stored in Production Information Management System Analysis Tasks Quality Assurance Process Optimization Fault Detection Fault Diagnosis... 8
Selected References voestalpine Stahl GmbH Analysis of continuous casting process Integration of expert knowledge visual Data Mining, Interpretation Böhler Edelstahl Quality analysis of high-grade steel production unisoftware plus machine learning framework (mlf) Basis for many projects in the area of process analysis Siemens Transformers Austria Optimization of power transformer cores Voith Paper, SCA Laakirchen Analysis and optimization in paper production Analysis tool PaperMiner AMS Engineering Knowledge discover in discrete manufacturing Analysis of stand stills, fault detection 9
Domain: Machine Manufacturer Data Center Machines at different locations generate streams of sensor data Stored in data center Analysis Tasks Usage Monitoring Profile Analysis Condition Monitoring Fault Detection Fault Diagnosis... 10
Domain: Decentralized Renewable Energy, Home Automation Data Center Sensors of different kind at each building generate streams of sensor data Temperature Solar radiation Energy production... Analysis Tasks Usage Monitoring Profile Analysis Condition Monitoring Fault Detection Fault Diagnosis 11
Application : Fault Detection for Renewable Energy Units (near) real time detection of faults of units naturally temporal task => Data Stream Processing profile analysis of units Need access to all units => central application large amount of devices => Big Data low false positive rate, i.e. good model needs considerable amount of historical data especially for long term drifts => Big Data 12
Fault Detection Algorithms A) Compare measured channels to a model Deviation indicate fault and its type A good model needs to be identified (learned) Typically using historical good data B) Fit known model type e.g. ARX: y t = a k y t k + i,k b i,k x i (t k) Bad coefficient of fitness indicates faults 13
Evaluated Solution Combination of Big Data Storage (BDS) for off-line MapReduce and Stream Processing Engine (SPE) for on-line, real-time unit 1 unit 2 SPE unit i MUX unit n BDS 14
Fault Detection Method A Compare measured channels to a mode MapReduce is used to calibrate model on historical data SPE applies model in user-defined operator (UDO) REPLAY for testing unit 1 unit 2 SPE Read e.g. from RDBMS unit i MUX REPLAY Model unit n BDS MapReduce 15
Fault Detection Method B Fit known model structure to data BDS supplies historical data for testing via REPLAY SPE incrementally fits certain kind of regression model unit 1 unit 2 SPE Mo del unit i MUX REPLAY unit n BDS 16
Stream Data Mining: Incremental Algorithms 1. Process an example at a time, and inspect it only once 2. Use a limited amount of memory 3. Work in a limited amount of time 4. Be ready to predict at any time Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Philipp Kranen, Hardy Kremer, Timm Jansen, Thomas Seidl. Journal of Machine Learning Research (JMLR) Workshop and Conference Proceedings. Volume 11: Workshop on Applications of Pattern Analysis (2010). 17
Stream Data Mining: Open Source Framework MOA MOA: Massive Online Analysis WEKA community, Java Big Data stream mining (classification, regression, and clustering) in real time Can be easily used with e.g. Hadoop Extendable with new mining algorithms Goal: provide a benchmark suite for the stream mining community http://moa.cms.waikato.ac.nz 18
Discussion General Setting Units generate streams of sensor data (time,value) Central storage of data for analysis tasks Many analysis tasks are temporal in nature; e.g. fault detection Implemented by current technology without much effort REPLAY partially solves the problem of implementing algorithms for MapReduce and SPE Issues: Usage of multiple SPE per machine or combiner Integration of existing incremental learning tools such as MOA 19
Related Work: TiMR Framework Combination of M-R and SPE (DSMS) Temporal queries for off-line and on-line Implemented using StreamInsight and SCOPE/Dryad Badrish Chandramouli, Jonathan Goldstein, and Songyun Duan. 2012. Temporal Analytics on Big Data for Web Advertising. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering (ICDE '12). IEEE Computer Society, Washington, DC, USA 20
Overview Temporal Analytics on Big Data Applications Failure detection Proposed Architecture Related Work Mo del Mo del Mo del Mo del Mo del Mo del Mo del Learning Big Models Causal Inference Enabled by parallelization Prediction und optimal control 21
Causal Models for Prediction and Fault Detection Setting Complex industrial process Limited knowledge about interdependencies Goal E.g. Predict amount of TOC in wastewater for next 48h Challenges Robustness of model Precision of model Several thousands of sensors => computational complexity Approach Identify causal model structure Use parallelization to tackle computational complexity 22
Base: Gaussian Graphical Models Linear Model Various methods to estimate parameters Prominent Method to estimate structure: Graphical Lasso (Friedman 2007, 2012) based on L1 regularized minimization of log-likelihood 23
Extension to time: Granger Causality X would Granger Cause Y if it contains information useful in forecasting Y Implemented by graphical lasso on time lagged variables Work in progress Grouped Granger Graphical Lasso Detection of control loops Non-linear extensions => increases computational complexity 24
Parallelization of Machine Learning Algorithms MapReduce (see first part of talk) Good for data-parallel: Problems with iterative algorithms and complex dependencies in the data GraphLab intuitively expresses computational dependencies applied to dependent records which are stored as vertices in a large distributed data-graph GPGPU complex low level code (kernel) or: High-Level languages: SAC, Matlab, Mathematica... Meta-Programming: PyCUDA / CL,... graphlab.org 25
Parallelization of Machine Learning Algorithms MapReduce (see first part of talk) data-parallel: Problems with iterative algorithms and complex dependencies in the data GraphLab intuitively expresses computational dependencies applied to dependent records which are stored as vertices in a large distributed data-graph GPGPU complex low level code (kernel) or: High-Level languages: SAC, Matlab, Mathematica... Meta-Programming: PyCUDA / CL,... Hardware agnostic Parallel Patterns Esp. Parallel Patterns for Machine Learning graphlab.org 26
ParaPhrase High-level design and implementation patterns useful parallelism for a wide range of parallel applications heterogeneous multicore/manycore systems Hardware Abstraction Basis : FastFlow Framework (Turin, Pisa) General Purpose Patterns Master Slave, Farm, Pipeline, work queue, data dependency Domain Specific Patterns (SCCH, HLR Stuttgart) Suitability of generic patterns for machine learning ML - Patterns: pool oriented, graphical models patterns, time series,... 27
Relevant Use-Cases / Project Competencies (selection) TRUMPF Austria Improving precision of bending machines K-Projekt SoftNet (I + II) Fault prediction in software systems Mining Repositories K-Projekt PAC Process Analytic Chemestry Virtual sensors for chemical process analysis and control BlueSky Locally optimized weather predictions Application : Energy Efficiency Verbund Prediction of available water flow to optimize renewable energy usage Based on machine learning framework 28
2 1 0-1 -2-3 -4 0 20 40 60 80 100 1 0 12 1 13 14 15 16 17 18 19 10 2 Use Case: Local Weather Prediction 49 925mb, 0.556939, 0.92949 9 10 11 12 13 14 15 16 17 18 49 48 Salzburg Linz St. Pölten Wien Eisenstadt 48 Data collection Bregenz 47 Innsbruck Graz 47 Klagenfurt 46 9 10 11 12 13 14 15 16 17 18 46 Analysis Data sources Global Weather Models 5 2.5 6 0-2.5-5 4 0 2 2 4 6 0 Expert Knowledge Prediction Local Sensors: Weather stations, power plante,... Topographie, Expert knowledge Models 1 0.5-0.5-1 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 1 23 4 56 7 89 Alcohol 20 40 60 80 100 14.61 14.29 13.97 13.65 13.33 13.01 12.69 12.37 12.05 11.73 11.41 Goal Planning of events, maintenance,... Basis for optimization of energy usage 29
Optimization of Renewable Energy Usage Flow values, Precipitation / Temperature & Forecast Snow melt, ground Humidity (Holzmann & Nachtnebel 2002) Data Driven Models (z.b. Ridge Regression, Neural Networks) Rainfall-Runoff-Model (Hebenstreit 2000) HYSIM II (Drabek et al. 2002) CH Legende: Laufkraftwerke der AHP Speicherkraftwerke der AHP Gemeinschaftskraftwerke der AHP Beteiligungen des Verbund INN Oberaufdorf-Ebbs Gerlos Mayrhofen Bösdornau Roßhag Braunau-Simbach Nußdorf D Passau-Ingling Schärding-Neuhaus Egglfing-Obernberg Ering-Frauenstein SALZACH INN Kreuzbergmaut Bischofshofen Urreiting Funsingau Schwarzach St. Veit Wallnerau Kaprun- Hauptstufe Häusling Kaprun-Oberstufe Reißeck-Kreuzeck Malta-Oberstufe Paternion DRAU Kellerberg Jochenstein Rosegg-St. Jakob Mühlrading Staning Garsten-St. Ulrich Rosenau Mandling Ternberg Klaus Salza Sölk Bodendorf-Paal Malta-Hauptstufe Malta-Unterstufe Villach Feistritz-Ludmannsdorf Aschach Ferlach-Maria Rain Ottensheim-Wilhering ENNS Triebenbach St. Georgen Abwinden-Asten St. Pantaleon Krippau Fisching MUR Bodendorf-Mur Wallsee-Mitterk. Leoben Friesach Graz DONAU Melk Losenstein Ybbs-Persenbeug Großraming Weyer Schönau Edling Annabrücke Altenmarkt Landl Hieflau St.Martin Lebring Lavamünd Schwabeck Altenwörth Dionysen Pernegg Laufnitzdorf Arnstein Rabenstein Peggau Weinzödl Spielfeld Greifenstein Mellach Gralla Gabersdorf Obervogau SLO CZ Freudenau SK H SAMBA: Optimal weighting of all models Goals Short Term: Inclusion of availability of renewable energy in energy planning and trading (Water, Wind, Solar) 30
Summary Temporal Analytics on Big Data Applications Failure detection Proposed Architecture Related Work (MOA, TiMR) Learning Big Models Causal Inference Enabled by parallelization Prediction und optimal control Use-Cases 31
Veranstaltungstipp! Mit geeigneter Strategie zur nachhaltigen Softwarequalität: TRUST-IT 18. April, 09:00-14:00 Österreichische Computergesellschaft, Wien Zielgruppe: Software-Entwicklungsleiter, Prozessverantwortliche, Projektleiter, Software- Qualitätsingenieure und Architekturverantwortliche. www.scch.at/de/trust-it-wien-programm 32
Kontakt DI Michael Zwick +43 7236 3343 843 michael.zwick@scch.at www.scch.at Dr. Thomas Natschläger +43 7236 3343 868 thomas.natschlaeger@scch.at www.scch.at Dr. Holger Schöner +43 7236 3343 816 holger.schoener@scch.at www.scch.at 33