A SMART ELEPHANT FOR A SMART-GRID: (ELECTRICAL) TIME-SERIES STORAGE AND ANALYTICS. EDF R&D SIGMA Project Marie-Luce Picard



Similar documents
Smart Grid Data Management Challenges

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

Advanced Big Data Analytics with R and Hadoop

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Using distributed technologies to analyze Big Data

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

BIG DATA What it is and how to use?

How To Handle Big Data With A Data Scientist

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Integrating a Big Data Platform into Government:

Cisco IT Hadoop Journey

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Big Data Analytics - Accelerated. stream-horizon.com

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Are You Ready for Big Data?

Workshop on Hadoop with Big Data

HPC technology and future architecture

Agile Business Intelligence Data Lake Architecture

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

How Companies are! Using Spark

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

I/O Considerations in Big Data Analytics

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Oracle Big Data SQL Technical Update

Cisco IT Hadoop Journey

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

ITG Software Engineering

The Next Wave of Data Management. Is Big Data The New Normal?

Hadoop Big Data for Processing Data and Performing Workload

Big Data Big Data/Data Analytics & Software Development

Real Time Big Data Processing

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Modern Data Architecture for Predictive Analytics

Big Data on Microsoft Platform

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Apache Kylin Introduction Dec 8,

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Where is... How do I get to...

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

Big Data Explained. An introduction to Big Data Science.

Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

ANALYTICS CENTER LEARNING PROGRAM

Implement Hadoop jobs to extract business value from large and varied data sets

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Business Intelligence for Big Data

Ganzheitliches Datenmanagement

A Brief Introduction to Apache Tez

Are You Ready for Big Data?

BIG DATA TRENDS AND TECHNOLOGIES

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

How To Scale Out Of A Nosql Database

Advanced In-Database Analytics

Transforming the Telecoms Business using Big Data and Analytics

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Using Data Mining and Machine Learning in Retail

Delivering Value from Big Data with Revolution R Enterprise and Hadoop

Bringing Big Data to People

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

The Future of Data Management

Industry 4.0 and Big Data

The Power of Pentaho and Hadoop in Action. Demonstrating MapReduce Performance at Scale

Manifest for Big Data Pig, Hive & Jaql

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Luncheon Webinar Series May 13, 2013

Unprecedented Performance and Scalability Demonstrated For Meter Data Management:

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data Spatial Analytics An Introduction

Extend your analytic capabilities with SAP Predictive Analysis

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Bringing Big Data Modelling into the Hands of Domain Experts

Transcription:

A SMART ELEPHANT FOR A SMART-GRID: (ELECTRICAL) TIME-SERIES STORAGE AND ANALYTICS WITHIN HADOOP EDF R&D SIGMA Project Marie-Luce Picard Forum TERATEC June 26th 2013

OUTLINE 1. CONTEXT 2. A PROOF OF CONCEPT USING HADOOP 3. ADVANCED ANALYTICS 4. CONCLUSION AND PERSPECTIVES 2

OUTLINE 1. CONTEXT 2. A PROOF OF CONCEPT USING HADOOP 3. ADVANCED ANALYTICS 4. CONCLUSION AND PERSPECTIVES 3

SMART GRIDS SMART METERS SMART DATA Motivated by economic or regulatory constraints, or environmental needs, smart-grid projects are spreading. New perspectives emerge for energy management as electric vehicles, and renewable energy generation will expand. A large amount of smart-meters and sensors will be deployed and they will provoke a data deluge utilities will have to face. 4

SMART METERING: A DATA DELUGE! French project: 35+ millions of Smart Meters, billions of meteringdata Currently, pilot program with300k smart meters deployed 5

SMART METERING DATA: WHAT DOES A LOAD CURVE LOOK LIKE? 6

SMART METERING DATA: WHAT DOES A LOAD CURVE LOOK LIKE? 7

SMART METERING DATA: WHAT DOES A LOAD CURVE LOOK LIKE? Individual load curves: -Left: same customer, two different days -Up: same day, two different customers 8

MASSIVE DATA MANAGEMENT IN THE ENERGY DOMAIN Challenges: More complexity in the electric power system (distributed generation, demand response ) Fastercomplexity of customer indoor equipment (smart meters and devices, Internet Of Things ) Core business will involve more IT and data management 9

OUTLINE 1. CONTEXT 2. A PROOF OF CONCEPT USING HADOOP 3. ADVANCED ANALYTICS 4. CONCLUSION AND PERSPECTIVES 10

MASSIVE TIME-SERIES STORAGE Objective: Proof Of Concept for running a large number of queries (variable levels of complexity with variable scopes and frequencies, variable acceptable latencies) on a huge number of load curves Data: individual load curves, weather data, contractual information, network data 1 measure every 10 mnfor 35 million customers Annual data volume: 1800 billion records, 120 TB of raw data Operational DataWareHouse able to: Supply a large volume of data Ingest new incoming data (pre-processing) Allow concurrent and simultaneous queries: tactical queries, analytical queries, adhoc queries Evaluation criteria: competition and performances (QoS, SLA), convergence and agility Traditional VLDB as well as Hadoopas a solution 11

USING HADOOP FOR STORING MASSIVE TIME-SERIES HADOOP: Native distributed file system(hdfs) Distributed processing using the MapReduce paradigm A large and fast evolving ecosystem Use cases: mainly on unstructured data Divide and Conquer 12

POC HADOOP: THE DATA MODEL Compressed data Volume on HDFS : 10 TB (x3) 13

DATA GENERATOR COURBOGEN Courbogen can generate massive datasets (not too statistical realistic) Generates load curves and associated data Customizable tool: interval, duration, data quality, noise on the curves Distributed architecture (NodeJS, Redis) Output as a data stream One week data for 35M curves 14

DESIGN R&D 20 nodes cluster, 336 cores All data stored in HDFS Use of HIVE for analytical queries (11 months of data) Use of HBASE at the forefront of data access (tactical queries) (3 months of data) 15

POC HADOOP: DATA UPLOAD Daily data Raw data: 327 Go; HDFS data: 50 Go, Hive/HBase: 25-28 Go Initial upload : 15 minutes Hadoop component Representationmodels Total volume of data (compressed) Time for uploading and preparing the data HIVE Tuplemode 10*3=30To 2h30,35 foruploading HBASE Array(dailydata) 3*3 =9To 8 minutes (only upload, aggregated data is computed in Hive). TOTAL 39 To 2H 38 16

POC HADOOP: RESULTS Overall GOOD results; for some types of queries, competitivewith traditional approaches Tactical queries are successful handled by HBase, offeringlow latencies under a high concurrent load Analytical queries are processed by Hive Analytical queries on HIVE 11 months of data, 35 millions of curves, 10 minutes interval Partitioning: day/tariff/power Tuple representation Query Execution time (1 day) Execution time (7 days) GlobalFranceaggregatedcurve 10mn 1 min, 56 sec 19 min, 24 sec Aggregatedcurveby tariffandpower 2 min, 21 sec 24 min, 33 sec Average consomption by building type (ad-hoc) 1 heure, 18 sec 3 heures, 16 min, 50 sec 17

POC HADOOP: RESULTS Tactical queries using HBase 3 months of data, 35 millions of curves, 10 minutes interval 500 concurrent queries, SLAs asked to be respected for 90% of queries Array representation Durée Pas de temps NB requêtes simultanées demandées / obtenu SLA demandé / obtenu par requête en sec + taux de réussite obtenu Ecart typeen sec Nombre de requêtes effectuées / sec Jour 10min 200 200 < 1 0,57 86,53 1,77 210,98 30min 50 50 < 3 0,68 93,54 1,98 73,83 Semaine 10min 160 160 < 2 2,11 88,36 9,16 45,76 horaire 40 40 < 7 2,89 92,24 11,43 13,83 Mois horaire 32 32 < 7 2,45 94,09 11,06 13,09 jour 8 8 < 10 1,68 94,67 7,35 4,79 Trimestre jour 8 8 < 10 2,13 94,33 9,12 3,77 hebdo 2 2 < 15 2,13 94,92 8,98 0,95 18

POC HADOOP: TIME SERIES REPRESENTATION MODELS 19

POC HADOOP: VISUALIZATION Hadoop / Tableau Software : Synchrones par RE 20

OUTLINE 1. CONTEXT 2. A PROOF OF CONCEPT USING HADOOP 3. ADVANCED ANALYTICS 4. CONCLUSION AND PERSPECTIVES 21

ADVANCED ANALYTICS WITH HADOOP Strong Needs for advanced analytics on smart metering data: Segmentation based on load curves, Forecasting on local areas, Scoring for Non Technical Losses, pattern recognition within load curves, predictive modeling, Implementing classical data-mining tasks vs home-made methods Use of existing librairiesand toolkits vsdeveloping HadoopOn Time-series toolkit 22

EXISTING ANALYTICS TOOLKIT Rhadoop 3 R packages for analyzing data stored within Hadoopfrom R (rmr2,rhdfs,rhbase) You need to re-implement methods using the rmr2 package Small users community Linked to Revolution Mahout A Java toolkit with data analysis methods implemented using MapReduce paradigm Apache foundation (http://mahout.apache.org/) Large users community Evolving (and growing) very fast 23

POC HADOOP: DATA CLUSTERING Examples of tests runon a data sets of 35 millions of curves, one measurementby curve (daily average consumption) Script Rhadoop(execution time: 3.12 heures) > tb_input_kmeans = mapreduce('/tmp/tb_kmeans_r.csv', input.format = make.input.format('csv', sep=','), structured = T, vectorized = T, map = function(k, v) keyval( v$v1 %% 35000, v, vectorized = T), reduce = function(k,vv) keyval(k, vv, vectorized = F), backend.parameters = list(hadoop = list(d ="mapred.reduce.tasks=200",d="mapred.map.tasks=200")), verbose=t ) > kmeans(tb_input_kmeans, ncenters = 20, iterations = 15, fast = T) Script Mahout (execution time: 17 minutes) $ mahout org.apache.mahout.clustering.conversion.inputdriver --input /user/hive/warehouse/sigma.db/tb_input_kmeans --output /user/sigma/outputvector/ $ mahout kmeans --input /user/sigma/outputvector/ -c clusters -k 20 --output /user/sigma/output -dm org.apache.mahout.common.distance.euclideandistancemeasure --maxiter 15 --overwrite --clustering 24

FROST : A LIBRARY FOR TIME SERIES TRANSFORMATION WITHIN HIVE Analytics and Data Mining on time series : curse of dimensionnality FROST is a set of user defined functions in HIVE enabling simplification of time series FROST.JAR (First Release Of time Series Toolkit) Example : ADD JAR FROST.JAR; SELECT ID,SAX(POWER,8,3) FROM BIG.DATA GROUP BY DAY; Other functions in FROST : PAA : Piecewise Aggregate Approximation DFT : Discret Fourier Transform DWT : DiscretWavelet Transform and other «home made» methods SAX principle (Symbolic Aggregate approximation) 25

SEARCHING TIME SERIES WITH HADOOP Objective: searching similar time-series within a huge sets of series Top-K or range queries based on a similarity measure Jumping or sliding windows Brutal force of distributed computations in the Hadoopenvironment: use of UDF Hive functions 26

SEARCHING TIME SERIES WITH HADOOP Data: 35 millions curves, 30 days Top-5 (or top-500), with UDAFfunctions on jumping windows ~4 mn45s 27

OUTLINE 1. CONTEXT 2. A PROOF OF CONCEPT USING HADOOP 3. ADVANCED ANALYTICS 4. CONCLUSION AND PERSPECTIVES 28

POC HADOOP : CONCLUSIONS Hadoopas an alternativeapproach for storing and analyzing massive time-series data (good results, some of them competitive for querying, ability to implement advanced analytics) Is Hadoopenterprise-ready? (+) : costs, commodity hardware, flexible, scalability, structured and non structured data, interoperability (-): skills (high complexity), maturity (security, operating, SQL-compliant) Recommendation: should be limited to specific (non critical) usages such as Big ETL, unstructured data, BI subset, Big sand-box and analytical warehouse 29

POC HADOOP : PERSPECTIVES Hadoop for unstructured data Text-mining applications Using Mahout library (more efficient, updatable) to implement text-mining functions (Lucent) Needs tuning (format, stop words, ) and integration (text + clustering, results visualization) Currently tested on tweets Hadoopand HPC Impact on the location of data (Hadooprecommendations: data nodes should be close to the processing nodes) Installation of Hadoop ecosystem on an HPC cluster Impact on performances (r/w) NetCDFfiles (coming from simulation codes) ingested by Hadoop: SciHadoop: Hadoop plug-in for NetCdf data sets Work in progress 30

REFERENCES A proof of concept with Hadoop: storage and analytics of electrical time-series. Marie-Luce Picard, Bruno Jacquin, Hadoop Summit 2012, Californie, USA, 2012. présentation : http://www.slideshare.net/hadoop_summit/proof-of-concent-with-hadoop vidéo: http://www.youtube.com/watch?v=mjzblmbvt3q&feature=plcp Massive Smart Meter Data Storage and Processing on top of Hadoop. Leeley D. P. dos Santos, Alzennyr G. da Silva, Bruno Jacquin, Marie-Luce Picard, David Worms,Charles Bernard. Workshop Big Data 2012, Conférence VLDB (Very Large Data Bases), Istanbul, Turquie, 2012. http://www.cse.buffalo.edu/faculty/tkosar/bigdata2012/program.php Smart Metering x Hadoop x Frost: A Smart Elephant Enabling Massive Time Series Analysis. Benoît Grossin, Marie-Luce Picard, Hadoop Summit Europe 2013, Amsterdam, Mars 2013 http://hadoopsummit.org/amsterdam/ Acknowledgments - Work achieved with: Alice Bérard, Charles Bernard, Leeley Daio-Pires-Dos-Santos, AlzennyrGomes DaSilva, BenoîtGrossin, Georges Hébrail, Bruno Jacquin, JiannanLiu, Vincent Nicolas, David Worms 31