BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research & Innovation 04-08-2011 to the EC 8 th February, Luxembourg Your Atos business Research technologists. and Innovation Powering progress
Table of Contents 1. Setting the scene: introduction to Big Data 2. Business potential 3. Open Source Big Data Management at Atos 4. New demanding environments for technology adoption: the Olympic Games 5. The opportunity of Future Internet The Future Internet Public Private Partnership The FI-WARE Project (Technology Foundation of the FI PPP) 6. Final remarks 2
Table of Contents 1. Setting the scene: introduction to Big Data 2. Business potential 3. Open Source Big Data Management at Atos 4. New demanding environments for technology adoption: the Olympic Games 5. The opportunity of Future Internet The Future Internet Public Private Partnership The FI-WARE Project (Technology Foundation of the FI PPP) 6. Final remarks 3
A definition of Big Data Big Data is a trend related to the 2000 1800 1600 1400 explosion of available data 1200 1000 800 600 Available storage Information created due to the augmentation of potential data sources beyond traditional structured data 400 200 0 2006 2007 2008 2009 2010 2011 Evolution of data created and available storage capacity (Source: IDC)
New sources of Big Data Unstructured data: multi-format, from multiple sources Real time data: value in the real time aspects of the data, often coming from sensors Exhaust data: tech logs that had not been previously considered Linked and shared data: data located in multiple data silos which get a meaning through relationships Social data: Data coming form social networks and related systems Open data: Data made publicly available by governments or public institutions
Big Data technical challenges Storing data Traditionnal RDBMS not suitable Requires massively distributed storage (including NoSQL databases) Processing data Requires massively distributed computing framework: MapReduce, Hadoop Real-time Big Data processing: still emergent Analysing data Data Mining / Machine Learning / Data visualization Requires new, specific implementation over distriuted computing frameworks Global issues (clear needs in data usage & exploitation): Tools are still immature Require specific, rare skills / knowledge
Storing Big Data the NoSQL movement NoSQL «Not Only SQL» Give up the relationnal model 4 kinds of databases: Key-value stores: distributed, persistent hashtables. Data stored by small chunks of data. Voldemort, Membase. Column oriented data stores: use of bi-dimensional keys (rows and columns), data is stored in column. Optimized for accessing consecutive elements. Google BigTable, Cassandra, HBase, Hypertables. Document oriented data stores: key-value stores, data is a semi-structured document (typically JSON or XML). MongoDB, CouchDB. Graph oriented databases: store and query graph structured data. Neo4j.
A comparison of open source NoSQL storage solutions Cassandra HBase Voldemort Membase MongoDB CouchDB Type Column Column Document Document Key-value Key-value Oriented Oriented Oriented Oriented Maturity *** *** *** * *** ** Support **** **** *** * *** *** Typical use Massive Massive Rich querying, amount of amountof Heavy load Heavy load Rich querying Heavy load data data Typical cluster size Up to hundreds of servers Up to hundredsof servers Up totensof Up totensof Up to tens of servers servers servers Some servers Specific multisite management Yes Yes No No Yes No Map-reduce processing Yes Yes No No Yes Yes support Performance *** *** ***** *** *** *
Batch computations on Big Data the MapReduce approach MapReduce is a framework to distribute processing tasks to a large cluster of computers. Two steps: Map step: division into independent subtasks executed in parallel by Map processes Reduce step: results recursively merged until the final result is computed. Map processes are executed where the data is data transfer reduction Adaptation of the algorithms to the MapReduce method Hadoop: main MapReduce framework
Towards Real time Big Data Complex Event Processing Complex Event Processing consists in the processing, filtering, aggregation and combination of a very large number of events (>100,000/s) in real time 2 categories of Complex event processing Computation-oriented CEP: Processing of both historical and real time data Detection-oriented CEP: Filtering and rules matching on real time data 3 concepts associated to Complex Event Processing Events: occurrence of an action, threshold Requests: used to trim, correlate and aggregate events Listener: called when a set of criteria is met to send proper notifications Main products in the field: Yahoo! S4, Esper, Streambase
Big Data elements together A Data brokerage platform Application Application Read/ Write / Update / Delete APIs Analytics (all data) Data brokerage platform (analytics & decision) Preparing Caching external data unstructured & non commited data distribute content and / or policy based in band Private cloud Internal DBMS Cache Self data Index Search Metadata Relational / No SQL DBMS Cache Self data Index Search Metadata Distributed FS RT Data API Static data API private / public cloud
Table of Contents 1. Setting the scene: introduction to Big Data 2. Business potential 3. Open Source Big Data Management at Atos 4. New demanding environments for technology adoption: the Olympic Games 5. The opportunity of Future Internet The Future Internet Public Private Partnership The FI-WARE Project (Technology Foundation of the FI PPP) 6. Final remarks 12
Business potential Source: Big data: The next frontier for innovation, competition, and productivity, McKinsey 13
Big Data vs. Application domains/sectors 14
Table of Contents 1. Setting the scene: introduction to Big Data 2. Business potential 3. Open Source Big Data Management at Atos 4. New demanding environments for technology adoption: the Olympic Games 5. The opportunity of Future Internet The Future Internet Public Private Partnership The FI-WARE Project (Technology Foundation of the FI PPP) 6. Final remarks 15
Open Source Big Data Management at Atos (I) Atos Worldline (HTTS) Atos Worldline provides e-commerce solutions to several major e-sellers, in particular to the majority of French mass retail sellers One of the main features of this kind of customer is the massive amount of users that their e-commerce websites have to handle simultaneously. For its next e-commerce solution, Massilia, Atos Worldline is investigating the use of NoSQL solutions for some data storage tasks 16
Open Source Big Data Management at Atos (II) Atos Research & Innovation BUSCAMEDIA main objective is to provide a semantic tool to index, access, and locate media content. In order to perform the automatic categorization and annotation of media content, the technologies being applied are based on Google s MapReduce concept, using Hadoop to implement this functionality (http://www.cenitbuscamedia.es/) BEinGRID Project 25 business experiments and a toolset repository of grid middleware upper layers TAF-GRIDs experiment: to detect Telecom Fraud where customers usage records were exchanged, captured, stored, and analyzed among different telecom operators. The technologies used to perform the experiment were Globus Toolkit, for computational grid jobs and, OGSA-DAI, for data management. http://www.beingrid.eu Atos Worldgrid Open Node: definition of a NoSQL storage mechanism for SmartGrid. In the context of Open Architecture for Secondary Nodes of the Electricity SmartGrid, a repository is being implemented using an Open Source Big Data solution. Cassandra has been chosen as the best solution for this implementation. 17
Table of Contents 1. Setting the scene: introduction to Big Data 2. Business potential 3. Open Source Big Data Management at Atos 4. New demanding environments for technology adoption: the Olympic Games 5. The opportunity of Future Internet The Future Internet Public Private Partnership The FI-WARE Project (Technology Foundation of the FI PPP) 6. Final remarks 18
Atos and the International Olympic Committee AtoS is the official IT provider for the International Olympic Committee since Barcelona 1992 Flawless IT on time, on budget Designing, building, and operating the critical IT infrastructure and systems that ensure the smooth and secure running of the Olympic Games IT and the immediate availability of results Every two years a new organization With 4 billion customers Operating 24/7 Creating a cohesive team of highly trained professionals out of ever changing partners Successfully operating in a new territory 19
Olympic Games: facts and figures 20
Technological complexity: demanding data storage and processing 21
Organizational complexity: the preparatory phase also manages a lot of data 22
Table of Contents 1. Setting the scene: introduction to Big Data 2. Business potential 3. Open Source Big Data Management at Atos 4. New demanding environments for technology adoption: the Olympic Games 5. The opportunity of Future Internet The Future Internet Public Private Partnership The FI-WARE Project (Technology Foundation of the FI PPP) 6. Final remarks 23
The FI PPP: Positioning in the FI landscape 24
FI PPP Programme Implementation 25
FI-WARE: Some figures and data Main data 26 partners 5 Universities 4248 Person Months (excl. open calls) Total Funding 41 M Open calls 12,3 M Total budget 66,4 M Three years duration
FI-WARE: Objectives
FI-WARE: Collaboration with Usage Areas
FI-WARE and Big Data The FI-WARE project addresses, among others, the following Data Management topics: Publish/ Subscription Complex Even Processing (CEP) Multimedia analysis Unstructured data analysis Meta-data pre-processing Semantic annotation and application support Big Data analysis, based on Hadoop (heterogeneous MapReduce platform mainly used for ad-hoc data exploration of large sets of data.) MongoDB: a scalable, high-performance, open source, document-oriented database. SAMSON Platform: high-performance streaming MapReduce platform that is used for the near-real time analysis of streaming data. 29
Additional resources Project website: http://www.fi-ware.eu/ The wiki: http://forge.fi-ware.eu/plugins/mediawiki/wiki/fiware/ The Vision of FI-WARE The FI-WARE Chapters and Respective Technical Content The FI-WARE Agile Development Methodology, Status, and Progress FI-WARE High Level Description: public deliverable available in https://forge.fi-ware.eu/ FI PPP in general: http://www.fi-ppp.eu/ 30
Table of Contents 1. Setting the scene: introduction to Big Data 2. Business potential 3. Open Source Big Data Management at Atos 4. New demanding environments for technology adoption: the Olympic Games 5. The opportunity of Future Internet The Future Internet Public Private Partnership The FI-WARE Project (Technology Foundation of the FI PPP) 6. Final remarks 31
Final Remarks Combine long term strategy for Big Data research with well focused short term projects/ experiments focused on : Industrial sectors (public administration, manufacturing, health, retail, etc) Evaluation of RoI Gathering new requirements (instead of starting with a technology-push approach) Look (& solve) at issues such as privacy & security, other regulatory aspects, economic elements like suitability of organizational structures Define incentives for data usage and added value creation (ex. http://ourservices.eu/ showcasing open government and open data) Collaborate with other initiatives that could bring new resources, FI PPP Technology Transfer to industrial organizations (energy, manufacturing) 32
Thanks For more information please contact: Nuria de Lama Representative of Atos Research & Innovation to the EC T+ 34 91214 9321 nuria.delama@atosresearch.eu Atos (Spain) Albarracín 25 E-28037 Madrid www.atos.net Atos, the Atos logo, Atos Consulting, Atos Worldline, Atos Sphere, Atos Cloud and Atos WorldGrid are registered trademarks of Atos SA. July 2011 2011 Atos. Confidential information owned by Atos, to be used by the recipient only. This document, or any part of it, may not be reproduced, copied, circulated and/or distributed nor quoted without prior written approval from Atos. Your business technologists. Powering progress For internal use only