A framework for easy development of Big Data applications

Similar documents
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

HDP Hadoop From concept to deployment.

Building Scalable Big Data Pipelines

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Luncheon Webinar Series May 13, 2013

Architectures for massive data management

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Big Data Course Highlights

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Peers Techno log ies Pv t. L td. HADOOP

Workshop on Hadoop with Big Data

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Integrating Big Data into the Computing Curricula

Apache Flink Next-gen data analysis. Kostas

Transforming the Telecoms Business using Big Data and Analytics

Hadoop Ecosystem B Y R A H I M A.

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Comprehensive Analytics on the Hortonworks Data Platform

Upcoming Announcements

The 4 Pillars of Technosoft s Big Data Practice

Hadoop & Spark Using Amazon EMR

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Hadoop Job Oriented Training Agenda

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data Analytics Nokia

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Real-time Big Data Analytics with Storm

Unified Batch & Stream Processing Platform

Qsoft Inc

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

HADOOP. Revised 10/19/2015

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Dominik Wagenknecht Accenture

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Creating Big Data Applications with Spring XD

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Big Data and New Paradigms in Information Management. Vladimir Videnovic Institute for Information Management

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Stream Processing on Demand for Lambda Architectures

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Apache Hadoop: The Big Data Refinery

Moving From Hadoop to Spark

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Reference Architecture, Requirements, Gaps, Roles

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Big Data Analytics - Accelerated. stream-horizon.com

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

BIG DATA TOOLS. Top 10 open source technologies for Big Data

COURSE CONTENT Big Data and Hadoop Training

Map Reduce & Hadoop Recommended Text:

Openbus Documentation

A Brief Outline on Bigdata Hadoop

How To Use Big Data For Telco (For A Telco)

Chapter 7. Using Hadoop Cluster and MapReduce

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Predictive Analytics with Storm, Hadoop, R on AWS

Streaming items through a cluster with Spark Streaming

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Bringing Big Data to People

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

the missing log collector Treasure Data, Inc. Muga Nishizawa

Hadoop: The Definitive Guide

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Real Time Data Processing using Spark Streaming

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

TRAINING PROGRAM ON BIGDATA/HADOOP

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Search and Real-Time Analytics on Big Data

Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Big Data & Security. Aljosa Pasic 12/02/2015

Complete Java Classes Hadoop Syllabus Contact No:

Dell In-Memory Appliance for Cloudera Enterprise

BIG DATA - HADOOP PROFESSIONAL amron

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Towards Smart and Intelligent SDN Controller

Real Time Big Data Processing

So What s the Big Deal?

Transcription:

A framework for easy development of Big Data applications Rubén Casado ruben.casado@treelogic.com @ruben_casado

Agenda 1. Big Data processing 2. Lambdoop framework 3. Lambdoop ecosystem 4. Case studies 5. Conclusions

About me :-)

Ø PhD in Software Engineering Ø MSc in Computer Science Ø BSc in Computer Science Academics Work Experience

About Treelogic

Treelogic is an R&D intensive company with the mission of creating, boosting, developing and adapting scientific and technological knowledge to improve quality standards in our daily life

TREELOGIC Distributor and Sales

Interna2onal Projects Na2onal Projects Research Lines Solu6ons Regional Projects R&D Manag. System Computer Vision Big Data Teraherzt technology Data science Social Media Analysis Seman2cs Security & Safety Jus2ce Health Transport Financial services ICT tailored solu2ons Internal Projects R&D

7 ongoing FP7 projects ICT, SEC, OCEAN Coordinating 5 of them 3 ongoing Eurostars projects Coordinating all of them

7 years experience in R&D projects More than 300 partners in last 3 years More than 40 projects with budget over 120 MEUR Project coordinator in 7 European projects Overall participation in 11 European projects Research & INNOVATION

www.datadopter.com

Agenda 1. Big Data processing 2. Lambdoop framework 3. Lambdoop ecosystem 4. Case studies 5. Conclusions

What is Big Data? A massive volume of both structured and unstructured data that is so large to process with traditional database and software techniques

How is Big Data? Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization - Gartner IT Glossary -

3 problems Volume Variety Velocity

3 solu6ons Batch processing NoSQL Real-time processing

3 solu6ons Batch processing NoSQL Real-time processing

Batch processing Scalable Large amount of static data Distributed Parallel Fault tolerant High latency Volume

Real- 6me processing Velocity Low latency Continuous unbounded streams of data Distributed Parallel Fault-tolerant

Hybrid computa6on model Low latency Massive data + Streaming data Scalable Combine batch and real-time results Volume Velocity

Hybrid computa6on model All data Batch processing Batch results New data Combination Final results Real-time processing Stream results

Processing Paradigms Inception 2003 Batch processing Large amount of statics data Scalable solution Volume 1ª Generation 2006 Real-time processing Computing streaming data Low latency Velocity 2ª Generation 2010 Hybrid computation Lambda Architecture Volume + Velocity 3ª Generation 2014

Processing Pipeline DATA ACQUISITION DATA STORAGE DATA ANALYSIS RESULTS

Agenda 1. Big Data processing 2. Lambdoop framework 3. Lambdoop ecosystem 4. Case studies 5. Conclusions

Open source framework What is Lambdoop? Software abstraction layer over Open Source technologies o Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident, Avro, Redis Common patterns and operations (aggregation, filtering, statistics ) already implemented. No MapReduce-like process Same single API for the three processing paradigms o Batch processing similar to Pig / Cascading o Real time processing using built-in functions easier than Trident o Hybrid computation model transparent for the developer

Why Lambdoop? Building a batch processing application requires o o o MapReduce developing Use other Hadoop related tools (Sqoop, Zookeper, HCatalog ) Storage systems (Hbase, MongoDB, HDFS, Cassandra ) Real-time processing requires o Streaming computing (S4, Storm, Samza) o Unboundend input (Flume, Scribe) o Temporal data stores (In-memory, Kafka, Kestrel)

Why Lambdoop? Building a hybrid computation system (Lambda Architecture) requires o o o Application logic has to be defined in two different systems using different frameworks Data must be serialized consistently and kept in sync between each system Developer is responsible for reading, writing and managing two data storage systems, performing a final combination and serving the final updated results

Why Lambdoop? One of the most interesting areas of future work is high level abstractions that map to a batch processing component and a real-time processing component. There's no reason why you shouldn't have the conciseness of a declarative language with the robustness of the batch/real-time architecture. Nathan Marz Rajat Jain Lambda Architecture is a implementation challenge. In many real-world situations a stumbling block for switching to a Lambda Architecture lies with a scalable batch processing layer. Technologies like Hadoop ( ) are there but there is a shortage of people with the expertise to leverage them.

Lambdoop Streaming data Workflow Data Operation Data Static data

Lambdoop Batch Hybrid Real-Time

Information represented as Data objects o Types: o o StaticData StreamingData Data Input o Every Data object has a Schema to describe the Data fields (types, nulleables, keys ) o A Data object is composed by Datasets.

Data Input Dataset o A Data object is formed by one or more Datasets. o All Datasets of a Data object share the same Schema o Datasets are formed by Register objects, o A Register is composed by RegisterFields.

Schema Data Input o Very similar to Avro definition schemas. o Allow to define input data s structure, fields, types, nulleables o Json format Station Title Lat. Lon. Date SO2 NO CO PM10 O3 dd vv TMP HR PRB 23 street 43.5 29 5.67 3 2011-01-0 4 7 8 0.35 13 67 158 3.87 18.8 34 982 32 road 44.5 5.72 2011-01-0 4 7 8.6 0.4 12 68 158 3.87 19 33 975 { } "type": "csv", "name": "AirQuality records", "fieldseparator": ";", "PK": "", "header": "true", "fields": [ {"name": "Station","type": "string","index": 0}, {"name": "Tittle","type": "string","index": 1,"nullable": "true"}, {"name": "Lat.","type": "double","index": 2,"nullable": "true"}, {"name": "Long.","type": "double","index": 3,"nullable": "true"}, {"name": "PRB","type": "double","index": 20,"nullable": "true"} ]

Importing data into Lambdoop o Loaders: Import information from multiple sources and store it into the HDFS as Data objects o Producers: Get streaming data and represent it as Data objects o Heterogeneous sources. Data Input o Serialize information into Avro format

Data Input Static Data example: Importing a Air Quality dataset from local logs to HDFS o o Loader Schema s path is files/csv/air_quality_schema //Read schema from a file String schema = readschemafile(schema_file); Loader loader = new CSVLoader("AQ.avro", uri, schema) Data input = new StaticData(loader);

Data Input Streaming Data example: Reading streaming sensor data from TCP port o Producer o Weather stations emit messages to port 8080 o Schema s path is files/csv/air_quality_schema int port = 8080; //Read schema String schema = readschemafile (schema_file); Producer producer = new TCPProducer ("AirQualityListener", refresh, port, schema); // Create Data object Data data = new StreamingData(producer)

Data Input Extensibility o Users can implement their own data loaders/producers 1) Extend Loader/Producer interface 2) Read data from original source 3) Get and serialize information (Avro format) considering Schemas

Opera6ons Unitary actions to process data An Operation takes Data as input, processes the Data and produces another Data as output Types of operations: Aggregation: Produces a single value per DataSet Filter: Output data has the same schema as input data Group: Produces several DataSet, grouping registers together Projection: Changes the Data schema, but preserves the records and their values Join: Combines different Data objects

Opera6ons Operations Aggregation(1) Aggregation(2) Filter Group Projection Join Count Skewness Filter Group Select Inner Join Average Z-Test Limit RollUp Frecuency Left Join Sum Stderror TopN Cube Variation Right Join MinValue Variance BottomN N-Til Outer Join MaxValue Covariance Max Mode Min

Opera6ons Extensibility (User Defined Operations): New operations can be defined implementing a set of interfaces: OperationFactory: Factory used by the framework in order to get batch, streaming and hybrid operation implementations when needed BatchOperation: Provides MapReduce logic to process the input Data StreamingOperation: Provides Storm/Trident based functions to process streaming registers HybridOperation: Provides merging logic between streaming and batch results

Opera6ons User Defined Operation interfaces

Workflows Sequence of connected Operations. Manages tasks and resources (check-points) in order to produce an output using input data and a set of Operations o o o BatchWorkflow: Runs a set of operations on StaticData input and produces a new StaticData as output StreamingWorkflow: Operates on a StreamingData to produce another StreamingData HybridWorkflow: Combines Static and Streaming data to produce completed and updated results (StreamingData) Workflow connections Data Workflow Data Data Workflow Workflow Workflow Workflow Data Data Workflow Data

// Batch processing example String schema = readschemafile(schema_file); Loader loader = new CSVLoader("AQ.avro",uri, schema) Data input = new StaticData(loader); Workflow wf = new BatchWorkflow(input); //Add a filter operation Filter filter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue(«street 45")); //Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2")); wf.addoperation(filter); wf.addoperation(avg); //Run the workflow wf.run(); //Get the results Data output = wf.getresults(); Workflows

Workflows //Real-time processing example Producer producer = new TCPPortProducer("QAtest", schema, config); Data input = new StreamingData(producer); Workflow wf = new StreamingWorkflow(input); //Add a filter operation Filter filter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("Estación Av. Castilla")); //Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2")); wf.addoperation(filter); wf.addoperation(avg); //Runs the workflow wf.run(); //Gets the results While (!stop){ Data output = wf.getresults(); }

// Hybrid computation example Producer producer = new PortProducer("catest", schema1, config); StreamingData streaminput = new StreamingData(producer); Loader loader = new CSVLoader("AQ.avro",uri, schema2) StaticData batchinput = new StaticData(loader); Data input = new HybridData(streamInput, batchinput); Workflow wf = new HybridWorkflow(input); //Add a filter operation Filter filter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("street 34")); wf.addoperation(filter); //Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2")); wf.addoperation(avg); //Run the workflow wf.run(); Workflows //Get the results While (!stop) { Data output = wf.getresults();}

Results exploita6on Filter RollUp StdError VISUALIZATION Avg Select Data EXPORT Cube Variance CSV, JSON, join

Results exploita6on Visualization /* Produce from Twitter */ TwitterProducer producer = new TwitterProducer( ); Data data = new StreamingData(producer); StreamingWorkflow wf = new StreamingWorkflow(data); /* Add operations to workflow*/ wf.addoperation(new Count()); /* Get results from workflow*/ Data results = wf.getresults(); /* Show results. Set dashboard refresh*/ Dashboard d = new Dashboard(config); d.addchart(lambdoopchart.createbarchart(results, new RegisterField("count"), Tweetscount");

Results exploita6on Visualization

Results exploita6on Visualization

Results exploita6on Export Data data = new StaticData(loader); Workflow wf = new BatchWorkflow(data); /* Add operations to workflow*/ wf.addoperation(new Count()); /* Get results from workflow*/ Data results = wf.getresults(); CSV, JSON, /* Export results */ Exporter.asCSV(results, File); MongoExport(results, Map<String, String> conf); PostgresExport(results, Map<String, String> conf);

Results exploita6on Alarms Data data = new StreamingData(producer); StreamingWorkflow wf = new StreamingWorkflow(data); /* Add operations to workflow*/ wf.addoperation(new Count()); /* Get results from workflow*/ Data results = wf.getresults(); /* Set alarm condition: T/F (e.g time or certain value) action: execution (e.g. show results, send an email)*/ AlarmFactory.setAlert(results, condition, action);

Agenda 1. Big Data processing 2. Lambdoop framework 3. Lambdoop ecosystem 4. Case studies 5. Conclusions

Change configurations and easily manage the cluster Friendly tools for monitoring the health of the cluster Wizard-driven Lambdoop installation of new nodes 55

Visual editor for defining workflows and scheduling tasks o Plugin for Eclipse o Visual elements for: Input Sources Loader Operations Operation parameters o RegisterFields o Static values Visualization elements o Generates workflow code o XML Import/Export o Scheduling of workflows

Tool for working with messy big data, cleaning it and transforming it. Import data in different formats Explore datasets Apply advanced cell transformations Refine inconsistencies Filter and partition your big data

Agenda 1. Big Data processing 2. Lambdoop framework 3. Lambdoop ecosystem 4. Case studies 5. Conclusions

Social Awareness Based Emergency Situation Solver Objective: To create event assessment and decision-making supporting tools which improve quickness and efficiency when facing emergency situations making. Exploit the information available in Social Networks to complement data about emergency situations Real-time processing

Alert detection Locations Information Attached resources (photo, video, links, )

Static stations and mobile sensors in Asturias sending streaming data Historical data of > 10 years Monitoring, trends identification, predictions Batch processing + Real processing+ Hybrid computation

Quantum Mechanics Molecular Dynamics Computer simulation of physical movements of microscopic elements Large amount of data as streaming in each time-step Real-time interaction (query, visual exploration) during the simulation Data analytics on the whole dataset Real time processing + Batch processing + Hybrid computation

Agenda 1. Big Data processing 2. Lambdoop framework 3. Lambdoop ecosystem 4. Case studies 5. Conclusions

Conclusions Big Data is not only batch processing To implement a Lambda Architecture is not trivial Lambdoop: Big Data made easy High abstraction layer for all processing model All steps in the data processing pipeline Same Java API for all programing paradigms Extensible

Roadmap Conclusions Now Next Release a early version of Lambdoop Framework as Open Source Get feedback from the community Increase the set of built-in functions o o Move all components to YARN Stable versions of Lambdoop ecosystem o Models (Mahout, Jubatus, Samoa, R) Beyond Configurable processing engines (Spark, S4, Samza ) Configurable data stores (Cassandra, MongoDB, ElephantDB, VoltDB )

If you want stay tuned about Lambdoop register in www.lambdoop.com ruben.casado@treelogic.com info@datadopter.com www.lambdoop.com www.datadopter.com www.treelogic.com @ruben_casado @datadopter @treelogic