Planning LEFEBVRE. Philippe Intelligence in IoT GAUTIER GOLIRO

Transcription

1 Planning Time Concept Subject Speaker 9:05 Keynote 1 Big Data Processing Sylvain LEFEBVRE 9:45 Keynote 2 10:30 Break 11:00 Keynote 3 11:45 Keynote 4 Smart Cities Gilles BETIS Distributed Philippe Intelligence in IoT GAUTIER Data Protection Anne BARBIER GOLIRO 12:30 Lunch 14:30 Workshop W 17:30 Cocktail Graph Databases Cedric FAUVET

2 Introduction to Big Data Processing Raja Chiky

3 3 About RDI Team R. Chiky : Associate professor in Computer Science LISITE-RDI Research interest: Data stream mining, scalability and resource optimization in distributed architectures (e.g cloud architectures), recommender systems Research field: Large scale data management Heterogeneous and static data Heterogeneous and dynamic data streams 1. Real-time and distributed processing of various data sources 2. Use semantic technologies to add a semantic layer 3. Recommender systems and collaborative data mining sensors 4. Optimizing resources in large scale systems

4 4 Goal of this talk Recognise some of the main terminology Remember that there are many tools available Focus on Hadoop - the most popular open-source Big Data eco-system Realise the potential of Big Data 26/11/2015

5 CONTENT What is Big Data? Data Streaming NoSQL databases Distributed File System MapReduce paradigm Visualization 5

6 Big Data: Buzzword! 6

7 7 What is Big Data? Volume of data created Worldwide Dawn of time EB ZB Big Data Elements 1 YB = 10^24 Bytes 1 ZB = 10^21 Bytes 1 EB = 10^18 Bytes 1 PB = 10^15 Bytes 1TB = 10^12 Bytes 1 GB = 10^9 Bytes ZB (E) Variety of data Velocity Volume Variety + Veracity (IBM) - information uncertainty Radio TV News s Facebook Posts Tweets Blogs Photos Videos (user and paid) RSS feeds Wikipedia GPS data RFID POS Scanners Velocity of data Walmart handles 1M transactions per hour Google processes 24PB of data per day AT&T transfers 30 PB of data per day 90 trillion s are sent per year World of Warcraft uses 1.3 PB of storage Facebook with a user base of 900 M users, had 25 PB of compressed data 400M tweets per day in June hours of video is uploaded to Youtube every minute Source: Big Data & Analytics - Why Should We Care?, Vishwa Kolla 26/11/2015

8 8 What is Big Data? Gartner Definition Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making McKinsey Definition A dataset whose size is beyond the ability of typical database software tools to capture, store, manage and analyze. 26/11/2015

9 Key factors 9 Cheap storage Recording everything is not expensive anymore Cloud computing Cheap, on demand computing resources from anywhere in the world and for everyone Business reasons New insights arise that give competitive advantage Data in various forms everywhere: IoT and IoE, Social Networks, Open Data The way we interact with each other and with data / information 26/11/ /11/2015

10 1 0 World of data Website logs Network monitoring Financial services Weather forecasting Data may come from humans, sensors or machines Power ecommerce Traffic control consumption 26/11/2015

11 11 Transforming our daily lives Then Now One size fits all Personalization & Targeted Selling Source: Big Data Trends by David Feinleib 26/11/2015

12 1 2 Fitness Then Now Manual tracking Focus on the goal Source: Big Data Trends by David Feinleib 26/11/2015

13 13 Big Data workflow Capture Store Analyze Visualize Challenges arise in all these steps 16/01/2014

14 Challenges: Data Collection 14 Heterogeneity of sources Company databases => Silos Sensor networks, Intelligent objects Data streams: Social Networks, financial information, etc. Data Velocity Data provenance and quality Security / privacy 26/11/2015

15 Type of data used in Big Data initiatives 15 Internal data Traditional sources «New data» Source: Big Data opportunities survey, Unisphere / SAP, May /11/2015

16 Challenges: Data Collection Velocity Source: 16

17 What is a data stream? 17 Golab & Oszu (2003): A data stream is a real-time, continuous, ordered (implicitly by arrival time or explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety. Massive volumes of data, items arrive at a high rate.

18 Data Stream Management Systems 18 DBMS DSMS Data model Permanent updatable relations Streams and permanent updatable relations Storage Data is stored on disk Permanent relations are stored on disk Streams are processed on the fly Query SQL language Creating structures Inserting/updating/deleting data Retrieving data (one-time query) SQL-like query language Standard SQL on permanent relations Extended SQL on streams with windowing Continuous queries Performance Large volumes of data Optimization of computer resources to deal with Several streams Several queries Ability to face variations in arrival rates without crash 26/11/2015

19 Too much data streams 19 Too much data streams but not enough knowledge 26/11/2015

20 Semantic Web technologies for data stream 20 Annotate stream data with semantic metadata Apply Linked Data principles to publish streaming data Interlink streaming data with existing datasets Integrate data stream processing + reasoning Objectives : interoperability, automation, enrichment 26/11/2015

21 Challenges in data storage 23 Large amounts of data Need to use a highly distributed architecture Massive queries Avoid joins since they are very time consuming Evolutionary schema Flexibility and scalability Predictable and low latency High availability Elasticity : Horizontal extensibility No need: Transaction / Strong consistency/ Complex queries 26/11/2015

22 Limitation of RDBMS If the only tool you have is a hammer, you tend to see every problem as a nail. Abraham Maslow 24

23 Limitation of RDBMS 25

24 Limitations of RDBMS 26 Relational DBMS offer: join operators between tables to build complex queries involving several entities Integrity constraints Transaction management ACID properties In highly distributed environment: These mechanisms have a significant cost: With most RDBMS, data are in one machine (one node) It is difficult to place the data on different nodes.

25 27 Not Only NO SQL Relational

26 NoSQL? 28 No SQL => Not Only SQL SQL must not die but storage solutions should be considered for specific applications (especially web applications) Exact name: Non relational DB ACID model does not allow scalability in a distributed environment, for example by limiting the write speed (most expensive) Atomicity Consistency Isolation Durability 4 Nos: 1) NO SCHEMA (schema free) 2) NO JOIN (extract data without joins) 3) NO DATA FORMAT(graph, document, row, column) 4) NO ACID Transactions

27 Scalability 30 Scalability is the ability of a system, network, or process to handle a growing amount of work or its ability to be enlarged to accommodate that growth. Two kinds : Vertical scalability (scale up) Add powerful servers Costly Easy to set-up Horizontal scalability (scale out) Add ordinary machines (commodity hardware) Less expensive More complex to set-up (due to load balancing concerns)

28 Sharding (data partitioning) 31 A shard (partition) is a logical division of a database into several independent parts. This allows us to obtain a storage capacity greater than the limited storage capacity of hard disks, or perform queries in parallel on multiple partitions. Two kinds: Vertical Sharding Horizontal Sharding Each node stores one (or mode) table(s) of a database Each node stores a subset (identified by a key range) of a data table

29 CAP theorem (E.Brewer, N. Lynch 2000) 32 consistency C CA: available, and consistent, unless there is a partition. Claim: every distributed system is on one side of the triangle. CP: always consistent, even in a partition, but a reachable replica may deny service without agreement of the others CAP Theorem : C-A-P: choose two. A Availability AP: a reachable replica provides service even in a partition, but may be inconsistent. P Partition-Tolerance

30 3 3 NoSQL Taxonomy Key-value Graph Data Column Document

31 3 4 Presentation of HDFS HDFS is the Hadoop distributed file system HDFS was inspired by Google File System (GFS) Namenode The Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, Google, Historically, it is composed of 2 main nodes : The NameNode is in charge of the metadata management (one NameNode per cluster) The DataNode is in charge of the data storage (one DataNode per machine) Each file in HDFS is split into blocks (the block size is 64 MB by default) Each block is replicated on different DataNodes (3 replicas by default) : The replication mechanism is important for both fault tolerance and data availability File Datanodes 3 2 4

32 Challenges in Data Analytics Problems in large scale analytics Distributed computation efficiency Evaluate performance gains from distribution Bringing data to the processor Efficient parallel algorithms (statistics, summaries) Speed analytics: Streaming computations Streaming languages and libraries Load balancing 3 5

33 Big Data: Technological challenges 3 6 Data infrastructure tools and platforms : data centers, cloud infrastructures, nosql databases, in-memory databases, Hadoop/Map Reduce Ecosphere New generation of front-end tools for BI and analytic systems: data visualization and visual analytics, self-service BI, Mobile BI Data processing : supercomputers, distributed or massively parallel-computing 26/11/2015

34 MapReduce Introduced by Google in 2004 Aim: to parallelize processing (indexing, data mining,...) Spread task of processing data on machines MapReduce Offers: Easy Parallelization and distribution of processing Fault Tolerance Load Balancing Abstraction for programmers

35 Programming model Based on funtional programming languages Developers implement two functions: Map: in map phase, data is put to a number of machines. Output is partitioned (sorted) by a key. It is a step of processing data in the form of key/value pairs. Reduce: For each key-group, data is aggregated (reduced). This is a step of aggregating the values for same key to form the final result.

36 MapReduce (Global architecture)

37 Example Word Count

38 Moving computation to Data Data is stored in Distributed File Systems Ex. Google File System, Hadoop Distributed File System Large blocks (a.k.a chunks), ususally of 64MB Chunks are replicated and distributed over machines Map function on each of the chunks A Master node knows the data locality Receive jobs Compute necessary map and reduce tasks Select and activate worker nodes (selected if possible close to data) Many MapReduce implementations: C#, C++, Erlang, Java, Python, etc. The Apache Hadoop Map Reduce is arguable the most prominent one

39 Hadoop Open Source project (Apache Software Foundation) Software platform that lets one easily write and run applications that process vast amounts of data across a cluster of machines Java It includes: MapReduce, HDFS, Hbase Yahoo! Is the biggest contributor htp://wiki.apache.org/hadoop/poweredby Amazon, Apple, Ebay, IBM, Google, Microsoft, SAP, Twitter, etc.

40 Hadoop Ecosystem Avro: Remote procedure call and data serialization Flume: harvesting, aggregating and moving large amounts of log data in and out of Hadoop Hbase: Column oriented DB Hive: warehouse structure and SQL-like access for data in HDFS Pig: high-level scripting language (Pig Latin) for querying Sqoop: SQL To Hadoop (data import from RDBMS to HDFS) Oozie: Job coordinator and workflow manager Hue: graphical interface for Hadoop Chukwa: large-scale log collection and analysis

41 Problems with MapReduce 44 The main principle of MapReduce is its simplicity, as long as the application does not require complex SQL queries MapReduce is fairly low-level: must think about keys, values, partitioning, etc All data and intermediate data is written to disk! All standard databases operations must be coded by hand Join, selection, projection, etc. Solution: Use high level languages Translate programs to MapReduce automatically Pig, Hive, etc. 26/11/2015

42 Alternative to MapReduce: Pig Implementation started at Yahoo! Research Execute more than 30% of Yahoo! s jobs Features Expresses sequence of mapreduce jobs Provide relational (SQL) operators (JOIN, GROUP BY, etc.)

43 Pig - example url-info Visits Given two tables: Visits, url-info Find the top 10 most visited pages in each category User Url Time Amy cnn.com 8:00 Amy bbc.com 10:00 Amy flickr.com 10:05 Fred cnn.com 12:00 Url Category PageRank cnn.com News 0.9 bbc.com News 0.8 flickr.com Photos 0.7 espn.com Sports 0.9 Load Group Visits by url Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls

44 in MapReduce

45 48 In Pig visits = load /data/visits as (user, url, time); gvisits = group visits by url; visitcounts = foreach gvisits generate url, count(visits); urlinfo = load /data/urlinfo as (url, category, prank); visitcounts = join visitcounts by url, urlinfo by url; gcategories = group visitcounts by category; topurls = foreach gcategories generate top(visitcounts,10); store topurls into /data/topurls ; * top, count are user defined functions 26/11/2015

46 Alternative to Mapreduce: Hive DataWareHouse on top of Hadoop Open-Source Written in java Metadata is stored in relational databases Provides an SQL like query language : HiveQL Provides the possibility to create user defined functions Indexing

47 Language - DDL Create a table hive> CREATE TABLE customer (age INT, address STRING) Display tables hive> SHOW TABLES; Describe tables hive> DESCRIBE customer; Modify a table hive> ALTER TABLE customer ADD COLUMNS (age INT); Delete a table hive> DROP TABLE customer;

48 Language DML Load a file hive> LOAD DATA LOCAL INPATH /data/home/test.txt OVERWRITE INTO TABLE customer; HIVEQL hive> SELECT c.age FROM customer c WHERE c.sdate= hive>insert OVERWRITE DIRECTORY /data/hdfs_file SELECT c.* FROM customer c WHERE c.sdate= ; Can use ODBC to connect other external BI tools

49 Next gen : spark *

50 A richer set of operators l Resilient Distributed Dataset l l l Immutable data tables Lazy transformations Some can be indifferently applied to streaming data l

51 Lambda architecture (By Nathan Marz) 55 SPEED LAYER REAL TIME STREAM PROCESSING SERVING LAYER DATA FLOW PRECOMPUT ED VIEWS QUERIES BATCH LAYER BATCH PROCESSING Source: Mathieu DESPRIEE (USI) Generic, scalable and fault-tolerant data processing architecture

52 Lambda architecture All data entering the system is dispatched to both the batch layer and the speed layer for processing. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views. The serving layer indexes the batch views so that they can be queried in ad-hoc way. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. Any incoming query can be answered by merging results from batch views and real-time views.

53 57 Big Data Stream Mining Machine Learning Distributed Batch Stream Hadoop S4, Storm Mahout SAMOA Non Distributed Batch Stream R, WEKA, MOA

54 Challenges in Data Access and Visualization 58 The main goal of data visualization is to communicate information clearly and effectively through graphical means Provide results of analytics workflow for faster systems such as real-time query interfaces Visualization is a form of knowledge compression - David McCandless 26/11/2015

55 Conclusion: Big Data challenges 5 9 Semantic Information aggregation Information aggregation: too much data to assimilate but not enough knowledge to act Distributed and real-time processing Design of real-time and distributed algorithms for stream processing and information aggregation Distribution and parallelization of data mining algorithms visual analytics and user modeling Dynamic user model Novel visualizations for very large datasets Data privacy Big Data is often generated by people Obtaining consent is often impossible and anonymisation is very hard 26/11/2015

56 6 0 Thanks to Marie-Aude Aufaure, ECP Sylvain lefebvre, ISEP 26/11/2015