A framework for easy development of Big Data applications

A framework for easy development of Big Data applications Rubén Casado ruben.casado@treelogic.com @ruben_casado

Agenda 1. Big Data processing 2. Lambdoop framework 3. Lambdoop ecosystem 4. Case studies 5. Conclusions

About me :-)

Ø PhD in Software Engineering Ø MSc in Computer Science Ø BSc in Computer Science Academics Work Experience

About Treelogic

Treelogic is an R&D intensive company with the mission of creating, boosting, developing and adapting scientific and technological knowledge to improve quality standards in our daily life

TREELOGIC Distributor and Sales

Interna2onal Projects Na2onal Projects Research Lines Solu6ons Regional Projects R&D Manag. System Computer Vision Big Data Teraherzt technology Data science Social Media Analysis Seman2cs Security & Safety Jus2ce Health Transport Financial services ICT tailored solu2ons Internal Projects R&D

7 ongoing FP7 projects ICT, SEC, OCEAN Coordinating 5 of them 3 ongoing Eurostars projects Coordinating all of them

7 years experience in R&D projects More than 300 partners in last 3 years More than 40 projects with budget over 120 MEUR Project coordinator in 7 European projects Overall participation in 11 European projects Research & INNOVATION

www.datadopter.com

What is Big Data? A massive volume of both structured and unstructured data that is so large to process with traditional database and software techniques

How is Big Data? Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization - Gartner IT Glossary -

3 problems Volume Variety Velocity

3 solu6ons Batch processing NoSQL Real-time processing

Batch processing Scalable Large amount of static data Distributed Parallel Fault tolerant High latency Volume

Real- 6me processing Velocity Low latency Continuous unbounded streams of data Distributed Parallel Fault-tolerant

Hybrid computa6on model Low latency Massive data + Streaming data Scalable Combine batch and real-time results Volume Velocity

Hybrid computa6on model All data Batch processing Batch results New data Combination Final results Real-time processing Stream results

Processing Paradigms Inception 2003 Batch processing Large amount of statics data Scalable solution Volume 1ª Generation 2006 Real-time processing Computing streaming data Low latency Velocity 2ª Generation 2010 Hybrid computation Lambda Architecture Volume + Velocity 3ª Generation 2014

Processing Pipeline DATA ACQUISITION DATA STORAGE DATA ANALYSIS RESULTS

Open source framework What is Lambdoop? Software abstraction layer over Open Source technologies o Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident, Avro, Redis Common patterns and operations (aggregation, filtering, statistics ) already implemented. No MapReduce-like process Same single API for the three processing paradigms o Batch processing similar to Pig / Cascading o Real time processing using built-in functions easier than Trident o Hybrid computation model transparent for the developer

Why Lambdoop? Building a batch processing application requires o o o MapReduce developing Use other Hadoop related tools (Sqoop, Zookeper, HCatalog ) Storage systems (Hbase, MongoDB, HDFS, Cassandra ) Real-time processing requires o Streaming computing (S4, Storm, Samza) o Unboundend input (Flume, Scribe) o Temporal data stores (In-memory, Kafka, Kestrel)

Why Lambdoop? Building a hybrid computation system (Lambda Architecture) requires o o o Application logic has to be defined in two different systems using different frameworks Data must be serialized consistently and kept in sync between each system Developer is responsible for reading, writing and managing two data storage systems, performing a final combination and serving the final updated results

Why Lambdoop? One of the most interesting areas of future work is high level abstractions that map to a batch processing component and a real-time processing component. There's no reason why you shouldn't have the conciseness of a declarative language with the robustness of the batch/real-time architecture. Nathan Marz Rajat Jain Lambda Architecture is a implementation challenge. In many real-world situations a stumbling block for switching to a Lambda Architecture lies with a scalable batch processing layer. Technologies like Hadoop ( ) are there but there is a shortage of people with the expertise to leverage them.

Lambdoop Streaming data Workflow Data Operation Data Static data

Lambdoop Batch Hybrid Real-Time

Information represented as Data objects o Types: o o StaticData StreamingData Data Input o Every Data object has a Schema to describe the Data fields (types, nulleables, keys ) o A Data object is composed by Datasets.

Data Input Dataset o A Data object is formed by one or more Datasets. o All Datasets of a Data object share the same Schema o Datasets are formed by Register objects, o A Register is composed by RegisterFields.

Schema Data Input o Very similar to Avro definition schemas. o Allow to define input data s structure, fields, types, nulleables o Json format Station Title Lat. Lon. Date SO2 NO CO PM10 O3 dd vv TMP HR PRB 23 street 43.5 29 5.67 3 2011-01-0 4 7 8 0.35 13 67 158 3.87 18.8 34 982 32 road 44.5 5.72 2011-01-0 4 7 8.6 0.4 12 68 158 3.87 19 33 975 { } "type": "csv", "name": "AirQuality records", "fieldseparator": ";", "PK": "", "header": "true", "fields": [ {"name": "Station","type": "string","index": 0}, {"name": "Tittle","type": "string","index": 1,"nullable": "true"}, {"name": "Lat.","type": "double","index": 2,"nullable": "true"}, {"name": "Long.","type": "double","index": 3,"nullable": "true"}, {"name": "PRB","type": "double","index": 20,"nullable": "true"} ]

Importing data into Lambdoop o Loaders: Import information from multiple sources and store it into the HDFS as Data objects o Producers: Get streaming data and represent it as Data objects o Heterogeneous sources. Data Input o Serialize information into Avro format

Data Input Static Data example: Importing a Air Quality dataset from local logs to HDFS o o Loader Schema s path is files/csv/air_quality_schema //Read schema from a file String schema = readschemafile(schema_file); Loader loader = new CSVLoader("AQ.avro", uri, schema) Data input = new StaticData(loader);

Data Input Streaming Data example: Reading streaming sensor data from TCP port o Producer o Weather stations emit messages to port 8080 o Schema s path is files/csv/air_quality_schema int port = 8080; //Read schema String schema = readschemafile (schema_file); Producer producer = new TCPProducer ("AirQualityListener", refresh, port, schema); // Create Data object Data data = new StreamingData(producer)

Data Input Extensibility o Users can implement their own data loaders/producers 1) Extend Loader/Producer interface 2) Read data from original source 3) Get and serialize information (Avro format) considering Schemas

Opera6ons Unitary actions to process data An Operation takes Data as input, processes the Data and produces another Data as output Types of operations: Aggregation: Produces a single value per DataSet Filter: Output data has the same schema as input data Group: Produces several DataSet, grouping registers together Projection: Changes the Data schema, but preserves the records and their values Join: Combines different Data objects

Opera6ons Operations Aggregation(1) Aggregation(2) Filter Group Projection Join Count Skewness Filter Group Select Inner Join Average Z-Test Limit RollUp Frecuency Left Join Sum Stderror TopN Cube Variation Right Join MinValue Variance BottomN N-Til Outer Join MaxValue Covariance Max Mode Min

Opera6ons Extensibility (User Defined Operations): New operations can be defined implementing a set of interfaces: OperationFactory: Factory used by the framework in order to get batch, streaming and hybrid operation implementations when needed BatchOperation: Provides MapReduce logic to process the input Data StreamingOperation: Provides Storm/Trident based functions to process streaming registers HybridOperation: Provides merging logic between streaming and batch results

Opera6ons User Defined Operation interfaces

Workflows Sequence of connected Operations. Manages tasks and resources (check-points) in order to produce an output using input data and a set of Operations o o o BatchWorkflow: Runs a set of operations on StaticData input and produces a new StaticData as output StreamingWorkflow: Operates on a StreamingData to produce another StreamingData HybridWorkflow: Combines Static and Streaming data to produce completed and updated results (StreamingData) Workflow connections Data Workflow Data Data Workflow Workflow Workflow Workflow Data Data Workflow Data

// Batch processing example String schema = readschemafile(schema_file); Loader loader = new CSVLoader("AQ.avro",uri, schema) Data input = new StaticData(loader); Workflow wf = new BatchWorkflow(input); //Add a filter operation Filter filter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue(«street 45")); //Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2")); wf.addoperation(filter); wf.addoperation(avg); //Run the workflow wf.run(); //Get the results Data output = wf.getresults(); Workflows

Workflows //Real-time processing example Producer producer = new TCPPortProducer("QAtest", schema, config); Data input = new StreamingData(producer); Workflow wf = new StreamingWorkflow(input); //Add a filter operation Filter filter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("Estación Av. Castilla")); //Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2")); wf.addoperation(filter); wf.addoperation(avg); //Runs the workflow wf.run(); //Gets the results While (!stop){ Data output = wf.getresults(); }

// Hybrid computation example Producer producer = new PortProducer("catest", schema1, config); StreamingData streaminput = new StreamingData(producer); Loader loader = new CSVLoader("AQ.avro",uri, schema2) StaticData batchinput = new StaticData(loader); Data input = new HybridData(streamInput, batchinput); Workflow wf = new HybridWorkflow(input); //Add a filter operation Filter filter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("street 34")); wf.addoperation(filter); //Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2")); wf.addoperation(avg); //Run the workflow wf.run(); Workflows //Get the results While (!stop) { Data output = wf.getresults();}

Results exploita6on Filter RollUp StdError VISUALIZATION Avg Select Data EXPORT Cube Variance CSV, JSON, join

Results exploita6on Visualization /* Produce from Twitter */ TwitterProducer producer = new TwitterProducer( ); Data data = new StreamingData(producer); StreamingWorkflow wf = new StreamingWorkflow(data); /* Add operations to workflow*/ wf.addoperation(new Count()); /* Get results from workflow*/ Data results = wf.getresults(); /* Show results. Set dashboard refresh*/ Dashboard d = new Dashboard(config); d.addchart(lambdoopchart.createbarchart(results, new RegisterField("count"), Tweetscount");

Results exploita6on Visualization

Results exploita6on Export Data data = new StaticData(loader); Workflow wf = new BatchWorkflow(data); /* Add operations to workflow*/ wf.addoperation(new Count()); /* Get results from workflow*/ Data results = wf.getresults(); CSV, JSON, /* Export results */ Exporter.asCSV(results, File); MongoExport(results, Map<String, String> conf); PostgresExport(results, Map<String, String> conf);

Results exploita6on Alarms Data data = new StreamingData(producer); StreamingWorkflow wf = new StreamingWorkflow(data); /* Add operations to workflow*/ wf.addoperation(new Count()); /* Get results from workflow*/ Data results = wf.getresults(); /* Set alarm condition: T/F (e.g time or certain value) action: execution (e.g. show results, send an email)*/ AlarmFactory.setAlert(results, condition, action);

Change configurations and easily manage the cluster Friendly tools for monitoring the health of the cluster Wizard-driven Lambdoop installation of new nodes 55

Visual editor for defining workflows and scheduling tasks o Plugin for Eclipse o Visual elements for: Input Sources Loader Operations Operation parameters o RegisterFields o Static values Visualization elements o Generates workflow code o XML Import/Export o Scheduling of workflows

Tool for working with messy big data, cleaning it and transforming it. Import data in different formats Explore datasets Apply advanced cell transformations Refine inconsistencies Filter and partition your big data

Social Awareness Based Emergency Situation Solver Objective: To create event assessment and decision-making supporting tools which improve quickness and efficiency when facing emergency situations making. Exploit the information available in Social Networks to complement data about emergency situations Real-time processing

Alert detection Locations Information Attached resources (photo, video, links, )

Static stations and mobile sensors in Asturias sending streaming data Historical data of > 10 years Monitoring, trends identification, predictions Batch processing + Real processing+ Hybrid computation

Quantum Mechanics Molecular Dynamics Computer simulation of physical movements of microscopic elements Large amount of data as streaming in each time-step Real-time interaction (query, visual exploration) during the simulation Data analytics on the whole dataset Real time processing + Batch processing + Hybrid computation

Conclusions Big Data is not only batch processing To implement a Lambda Architecture is not trivial Lambdoop: Big Data made easy High abstraction layer for all processing model All steps in the data processing pipeline Same Java API for all programing paradigms Extensible

Roadmap Conclusions Now Next Release a early version of Lambdoop Framework as Open Source Get feedback from the community Increase the set of built-in functions o o Move all components to YARN Stable versions of Lambdoop ecosystem o Models (Mahout, Jubatus, Samoa, R) Beyond Configurable processing engines (Spark, S4, Samza ) Configurable data stores (Cassandra, MongoDB, ElephantDB, VoltDB )

If you want stay tuned about Lambdoop register in www.lambdoop.com ruben.casado@treelogic.com info@datadopter.com www.lambdoop.com www.datadopter.com www.treelogic.com @ruben_casado @datadopter @treelogic