Time-Series Databases and Machine Learning

Similar documents

Information Builders Mission & Value Proposition

The Internet of Things and Big Data: Intro

SQL on NoSQL (and all of the data) With Apache Drill

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

MapR: Best Solution for Customer Success

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Big Data Analytics Nokia

HDP Hadoop From concept to deployment.

How To Handle Big Data With A Data Scientist

Oracle Database 12c Plug In. Switch On. Get SMART.

Self-service BI for big data applications using Apache Drill

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

How To Scale Out Of A Nosql Database

The Future of Data Management

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Integrating a Big Data Platform into Government:

INTRODUCTION TO CASSANDRA

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

From Spark to Ignition:

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Data processing goes big

Hadoop & Spark Using Amazon EMR

Real Time Big Data Processing

2015 Analyst and Advisor Summit. Advanced Data Analytics Dr. Rod Fontecilla Vice President, Application Services, Chief Data Scientist

The Inside Scoop on Hadoop

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Self-service BI for big data applications using Apache Drill

Getting Real Real Time Data Integration Patterns and Architectures

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Saving Millions through Data Warehouse Offloading to Hadoop. Jack Norris, CMO MapR Technologies. MapR Technologies. All rights reserved.

Tap into Hadoop and Other No SQL Sources

Towards Smart and Intelligent SDN Controller

SELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM

Fast Innovation requires Fast IT

OnX Big Data Reference Architecture

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Architecting for the Internet of Things & Big Data

Ganzheitliches Datenmanagement

BIG DATA ANALYTICS For REAL TIME SYSTEM

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Choosing The Right Big Data Tools For The Job A Polyglot Approach

Tungsten Replicator, more open than ever!

CitusDB Architecture for Real-Time Big Data

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Testing Big data is one of the biggest

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Why Spark on Hadoop Matters

Trafodion Operational SQL-on-Hadoop

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Big Data Management and Security

Big Data and Data Science: Behind the Buzz Words

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

Using distributed technologies to analyze Big Data

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Sentimental Analysis using Hadoop Phase 2: Week 2

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Big Data and Analytics in Government

Big Data at Cloud Scale

HDP Enabling the Modern Data Architecture

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Time series IoT data ingestion into Cassandra using Kaa

Implement Hadoop jobs to extract business value from large and varied data sets

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

How To Create A Data Visualization With Apache Spark And Zeppelin

The Future of Data Management with Hadoop and the Enterprise Data Hub

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

The Enterprise Data Hub and The Modern Information Architecture

Lustre Monitoring with OpenTSDB

Apache HBase. Crazy dances on the elephant back

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Analyzing Big Data with AWS

Big Data With Hadoop

Three Open Blueprints For Big Data Success

How to Enhance Traditional BI Architecture to Leverage Big Data

MADOCA II Data Logging System Using NoSQL Database for SPring-8

Transcription:

Time-Series Databases and Machine Learning Jimmy Bates November 2017 1

Top-Ranked Hadoop 1 3 5 7 Read Write File System World Record Performance High Availability Enterprise-grade Security Distribution Ease of Data Ingestion Complete Data Protection Real Multi-tenancy Unbiased Open Source 2 4 6 8 2

Top-Ranked Hadoop Distribution Top-Ranked NoSQL 3

Top-Ranked NoSQL 1 3 5 Real-Time Operational Analytics Performance Security Unified Data Access Scalability and Reliability Architectural and Administrative Simplicity 2 4 6 4

Top-Ranked Hadoop Distribution Top-Ranked NoSQL Top-Ranked SQL-on-Hadoop Solution 5

Top-Ranked SQL-on-Hadoop 1 The Most SQL Options on Hadoop Solution Including our work with Apache Drill 2 6

Agenda 1 3 5 What is Time Series The Data Science User Resources for Getting Started A Basic Workflow A Quick Look at OpenTSDB 2 4 7

Time Series Data What is Time Series Data? Any set of data-points that have time associated with it. Examples of time series data include weather, web clickstream, product sale, stock trade, machine logs, fleets, sensors, devices, network traffic, system login 8

Value of Time Series Analytics Forecasting Exec Dashboards Seasonality Correlation Use Cases Cut Across Different Industries Trend Analysis IOT Sensor Data Operational Behavioral Analysis Preventative Maintenance Environmental Monitoring (Climate) Cell Tower Data Analysis Stock Trade Data Analysis 9

Typical Time Series Analytics Flow Data Sources Data Store Extract Prep Aggregate Reports Batch ETL Environment: IT teams/application Development groups/analytics groups Persona: Enterprise Architect, NoSQL Developer, Database developer, Analyst Task: Conduct analyses or provide reports to management 10

Typical Time Series Analytics Flow - Challenges Data Sources Data Store Extract Prep Aggregate Reports Batch Processing Larger Volumes of data from newer sources Expensive to store Takes a lot of time to process information and do simple aggregations Not Real-time 11

Challenges with Status Quo Existing systems aren t well suited to store high volumes of timeseries data Building aggregations and statistical computations from time-series data is currently done in batch Need: Cost effective and reliable way to store and analyze large amounts of time series data from various sources. Ability to deploy real-time dash-boarding and monitoring capabilities on aggregated data 12

Time Series Solution Example MapR Data Platform 13

MapR Data Exploration Advantages for IoT Self-Service Data Exploration Single SQL Interface for Structured and Semi-Structured Data Data Agility with Less IT Required 14

Squaring the Circle Enter Apache Drill Drill is SQL compliant Uses standard syntax and semantics Drill extends SQL First class treatment of objects, lists Full support for destructuring, flattening Full power of relational model can be applied to complex data 15

Drill Provides Scalable and Extended SQL 16

Introduction to Open TSDB HBase or MapR-DB 17

Speeding up OpenTSDB Why can t it be faster? 20,000 data points per second per node in the cluster 18

Speeding up OpenTSDB: open source MapR extensions Available on Github: https://github.com/mapr-demos/opentsdb 19

Speeding up OpenTSDB: open source MapR extensions Samples Message queue Collector MapR table Web service queries database and collector Buffering data for 1 hour in collector allows >1000x decrease in insertion rate Log Web service Users Logging latest hour of data allows clean restart of collector (lambda + epsilon architecture) 20

Direct Blob Loading for Testing Server Collector Server Collector Grafana OpenTSDB UI TSD TSD Scripts Blob Loader Blob Loader HBase or MapR DB 21

A first example: Time-series data 22

Column names as data When column names are not pre-defined, they can convey information Examples Time offsets within a window for time series Top-level domains for web crawlers Vendor id s for customer purchase profiles Predefined schema is impossible for this idiom 23

Relational Model for Time-series Time series ID Sample time value 101 101 101 15:51:03 15:52:07 15:52:11 1.16 0.04 0.08 Row key Data values 24

Table Design: Point-by-Point 25

Table Design: Hybrid Point-by-Point + Sub-table After close of window, data in row is restated as column-oriented tabular value in different column family. 26

Compression Results Samples are 64b time, 16 bit sample Sample time at 10kHz Sample time jitter makes it important to keep original time-stamp How much overhead to retain time-stamp? 27

Insertion Speeds Inserting pre-bundled data allows humongous data rates >100 M s/s on 4 nodes >200 M s/s on 8 nodes Linear scaling if enough data sources 28

How fast is OpenTSDB? OpenTSDB can scale to writing millions of data points per 'second' on commodity servers with regular spinning hard drives What if we needed 100-1000x faster writing speed? 29

Problem: How Do Load Test Data at Large Scale? Testing at large scale is a more realistic measure of performance than on a small sample But with high velocity data, how do you set up test? For a sample equivalent to long term data, you need ingest rates of 100 to 1000x faster than normal production - OR- - You must wait years to load up test data. What s the solution? 30

The need for rapid data loading Let s say you have 1M samples / second Suppose you want to test your system Perhaps with a year of data And you want to load that data in << 1 year 100x real-time = 100M samples / second 31

32

Short Books by Ted Dunning & Ellen Friedman Published by O Reilly in 2014 and 2015 For sale from Amazon or O Reilly Free e-books currently available courtesy of MapR http://bit.ly/ recommendationebook http://bit.ly/mapr-tsdbebook http://bit.ly/ebookanomaly http://bit.ly/ebook-realworld-hadoop 33

Thank You @mapr maprtech mapr-technologies MapRTechnologies jbates@mapr.com maprtech 34