Big Data: Quelques Enjeux Techniques



Similar documents
Applications for Big Data Analytics

Challenges for Data Driven Systems

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

So What s the Big Deal?

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Infrastructures for big data

Cloud Scale Distributed Data Storage. Jürmo Mehine

Sunnie Chung. Cleveland State University

Data Mining + Business Intelligence. Integration, Design and Implementation

The Data Mining Process

COMPANY POLICIES TO PREVENT CORRUPTION: A FRENCH EXAMPLE

Big Data and Data Science: Behind the Buzz Words

ANALYTICS IN BIG DATA ERA

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Social Media Mining. Data Mining Essentials

Transforming the Telecoms Business using Big Data and Analytics

BIG DATA What it is and how to use?

Advanced Big Data Analytics with R and Hadoop

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Big Data-ready, Secure & Sovereign Cloud

Azure Machine Learning, SQL Data Mining and R

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Hadoop SNS. renren.com. Saturday, December 3, 11

Role of Social Networking in Marketing using Data Mining

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

HPC ABDS: The Case for an Integrating Apache Big Data Stack

The University of Jordan

ANALYTICS CENTER LEARNING PROGRAM

Integrating Big Data into the Computing Curricula

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

The 4 Pillars of Technosoft s Big Data Practice

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Information Management course

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Industry 4.0 and Big Data

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Data Mining Analytics for Business Intelligence and Decision Support

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

2015 Analyst and Advisor Summit. Advanced Data Analytics Dr. Rod Fontecilla Vice President, Application Services, Chief Data Scientist

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Statistics for BIG data

Mining Large Datasets: Case of Mining Graph Data in the Cloud

NoSQL Data Base Basics

Massive Cloud Auditing using Data Mining on Hadoop

Predictive Analytics. Noam Zeigerson, CTO

Reference Architecture, Requirements, Gaps, Roles

Using Data Mining for Mobile Communication Clustering and Characterization

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

BIG DATA IN BUSINESS ENVIRONMENT

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Advanced In-Database Analytics

Big Data Storage Architecture Design in Cloud Computing

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

UPS battery remote monitoring system in cloud computing

The Big Data Paradigm Shift. Insight Through Automation

Analyze It use cases in telecom & healthcare

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Distributed Computing and Big Data: Hadoop and MapReduce

Big Data Technologies Compared June 2014

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

This Symposium brought to you by

Chapter 7. Using Hadoop Cluster and MapReduce

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

Data Mining Algorithms Part 1. Dejan Sarka

Databases 2 (VU) ( )

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

NoSQL for SQL Professionals William McKnight

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches.

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Introduction to Data Mining

Introduction to Data Mining

MS1b Statistical Data Mining

The 3 questions to ask yourself about BIG DATA

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Machine Learning using MapReduce

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Lecture Data Warehouse Systems

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Transcription:

Big Data: Quelques Enjeux Techniques Essai de Typologie des Problèmes de Big Analytics J.F. Marcotorchino VP, Scientific Director, GBU SIX Thales Communications & Security

The information contained in this document and any attachments are the property of THALES. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document is strictly prohibited without Thales prior written approval. THALES 2011. Template trtp version 7.0.8 2 / Split BIG DATA/BIG ANALYTICS

3 / Definitions Big Data: All the technologies and techniques that help scaling Large File Storage (virtual) Distributed processing (Hadoop) / Map-reduce NoSQL databases / simple & complex query Big Analytics: Techniques that are executed on a BigData infrastructure and have the following properties: Adaptation of ad hoc techniques (statistics-learning) to this environment Scales Linearly (O(N) or O(NLog(N)) order of magnitude or subject to heavy potential parallelization Linearization is mandatory either at criteria level or at constraints polytopes level Use special type of learning techniques through dimensions reduction. The information contained in this document and any attachments are the property of THALES. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document is strictly prohibited without Thales prior written approval. THALES 2011. Template trtp version 7.0.8

4 / Les 4 V The 4 V Challenge Volume : Large Storage Capacity are available now NAS type (Network Attached Storage): Virtualized Storage Cloud Computing Velocity: Large Demand for Immediate results Stream Analytics for SEP/ CEP (Stream &Complex event processing) In memory Computations adapted to Key-Value stores Variety: Large Diversity of Heterogeneous Data Types Structured Data (classical DB entries) or Semi Structureed Data (Images with meta data added) Unstructured Data: Text, Speech, Raw Images etc Value: Intrinsic Value of the couple «Data/Information» is now recognized by Business companies la (((*valeur «α N» (α entier) on doit répartir les calculs sur α machines pour conserver The information contained in this document and any attachments are the property of THALES. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document is strictly prohibited without Thales prior written approval. THALES 2011. Template trtp version 7.0.8

5 / Some Confusions to Avoid Do not confound : Combinatorial Complexity vs Indexing complexity, difficulty of IT computations vs the management of huge data volumes (HPC vs BIG DATA) In the first case: It is not the data amount per se which is a drawback, but the intrinsic combinatorial structure of the problem to solve: Example: 10 29300 solutions (Berendt -Tassa estimate 2010) to explore for clustering a set of N=10000 objects or individuals. Nevertheless N=10000 is not a huge amount In the second case: It is the data amount itself which poses a problem, through the structure of the indexing and storing architectures. (Difficulty due to the scalability constraints) The information contained in this document and any attachments are the property of THALES. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document is strictly prohibited without Thales prior written approval. THALES 2011. Template trtp version 7.0.8

6 / How to address Scalability Problems Scalability by «Linearization» VS Scalability by «Parallelization» The information contained in this document and any attachments are the property of THALES. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document is strictly prohibited without Thales prior written approval. THALES 2011. Template trtp version 7.0.8 In the First Mode : If for a population of N objects the needed computing time is T, in case of a linear algorithm it will take a computing time αt if the population size jumps up from N to α N. In the Second Mode: If an algorithm dedicated to a population size N can be processed on a SINGLE machine within a time T, then if the la population scales up to α N (α integer ), computations can be distributed on «α» machines to keep a computing time equal to : T Combination of both modes is the best possible approach (if suitable)

An Operational Characterization of Big Analytics Methods Big Data Analytics : «Extended» VS «Intrinsic» cases «Extended» Case: Possible use of the NoSQL storing architectures, or new SQL ones Exhaustive Analysis of the whole data set is not mandatory at all «Analytic Sampling» or «Big Sampling» are sufficient in most cases: e.g: Customers Segmentation, CRM, Cross selling, Churn & Attrition Analysis, Intrusions Analysis or HUMS (Health & Usage Monitoring Systems). The remaining set of the population except «samples» is processed by «inferential segmentation» or by «linear assignment»

An Operational Characterization of Big Analytics Methods Big Data Analytics : «Extended» VS «Intrinsic» cases «Intrinsic» Case: It is mandatory to rely on the full data set (exhaustivity ), even if avoiding to do it, is still remaining a research topic No a priori knowledge, or partial knowledge of the population structure Data are stored through NoSQL architectures using the adequate correspondence formats (example for graphs DB: Neo4j, FlockDB ( open source distributed, fault-tolerant graph database for managing data at scale., chosen by Twitter) To manage the exhaustivity constraint, obligation to use heuristics or meta heuristics based upon linear iterations, or parallelization through distributed computations

Some NoSQL DB Types Amazon) DynamoDB (Amazon Key Value Stores Column Oriented DB (Faceboo Facebook) BigTable (Google Google) Document Oriented DB Complex grows like E Rel E = nb. of Entities Rel = average relationships / entity Infinity DB Graph Data Bases Neo4j

BIG DATA CONCEPTUAL FOUNDATIONS [Brewer CAP Assignment] Availability Voldemort CA AP It is impossible to satisfy the 3 items choose 2 CouchDB Consistancy CP Partition Tolerence HBase direction ou services MemcacheDB /Bekerley DB

Some ideas for solving Intrinsic Big Analytics approaches Use mainly exhaustive methods (if possible no statistical sampling) (Data Driven vs Hypothesis Driven ) Affinity Analysis & Sequential Patterns (pure linear matchings scalar products) Use Classifiers with linear criteria Practice Iterative Queries R 2 I 2 : Requêtage Récursif Itératif Intelligent (application de deux techniques en alternance: Similarité Régularisée + Clustering «on the fly») Unsupervised Clustering (no a priori) (Extending «No K-Means» approaches using linear relational criteria) Text mining (word spotting) Reticular Data Analysis (Social Nets, Huge IT Networks) Ce document ne peut être reproduit, modifié, adapté, publié, traduit, d'une quelconque façon, en tout ou partie, ni divulgué à un tiers sans l'accord préalable et écrit de Thales THALES 2012 Tous Droits réservés Modèle trtp version 7.1.0 Routing procedures, Modularizations, Dynamic Topology

The information contained in this document and any attachments are the property of THALES. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document is strictly prohibited without Thales prior written approval. THALES 2011. Template trtp version 7.0.8 12 / BIG ANALYTICS TYPOLOGY

Tentative structuring of Big Analytics Approaches Learning &Neural Nets Self Encoded and Hourglass Shaped Neural Nets Reticular Data Structuring Social Networks Communities detection Level of Problem Complexity MDL Learning Models Learning Model for unsupervised Classif Limited Layers Neural Nets Supervised Rule Based Classification BiClass SVM Naïve Bayes Networks Multi Classes SVM MOLAP and XOLAP Classical BI Data Mining Image & Video Analytics Unsupervised Clustering Reticular Visual Analytics Parallel Coordinates Large Networks Topological Design Faces &Pattern Recognition Piecewise Linear Regression Sequential Patterns Recognition & Affinity Analysis Vector Matching Structuring Lack of Population Knowledge

An Example of Intrinsic Big Analytics Problem: Graphs Modularity Krebs Graph on American Politics S. Mandal (MIT) Liberal Centrist Conservative Girvan-Newman s Quadratic formulation modularity of network is the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random. ( Deviation to Independence ) Maximizing modularity rigorously may be NP-hard Use heuristic approaches MIT Heuristic Algo: Construct the modularity matrix and find its largest eigenvalue and eigenvector Partition network into two parts based on signs of elements in the largest eigenvector Repeat for each part If a proposed split does not cause modularity to increase, declare subgraph indivisible and do not split it When entire graph consists of indivisible subgraphs, stop Typical running time O(N 2 log N) for a sparse graph Ce document ne peut être reproduit, modifié, adapté, publié, traduit, d'une quelconque façon, en tout ou partie, ni divulgué à un tiers sans l'accord préalable et écrit de Thales THALES 2012 Tous Droits réservés Modèle trtp version 7.1.0

By relational transform we turn the criterion into a linear function subject to linear constraints X ij X ji = 0 (i,j) (Symmetry) X ii = 1 i (Reflexivity) X ij + X jk X ik 1 (i,j,k) (Transitivity) X ij {0,1} (Binarity) Idea : relying on the locally linear «Louvain» algorithm (Blondel- Guillaume) (Univ Louvain/UPMC LIP6), use the Linear Relational Form O(N LogN ) We can do more: using the genericity of the Louvain s algo we can use better linear criteria than the Girvan-Newman s one based on Optimal Transport justifications e.g:«deviation to Indetermination» (Patricia Conde- Cespèdes ) Ce document ne peut être reproduit, modifié, adapté, publié, traduit, d'une quelconque façon, en tout ou partie, ni divulgué à un tiers sans l'accord préalable et écrit de Thales THALES 2012 Tous Droits réservés Modèle trtp version 7.1.0

Big Analytics :Some Topics of Interest Big Analytics for Cyber-Security Big Analytics for Smart Transport Big Analytics for National Security Big Analytics for maintenance: Components for attack detection and investigation (Intelligent IDS from normalized log analytics, IS passive and dynamic mapping, logs analytics, cyber Intelligence) Attack detection from relational & content data, intelligent IDS and sandbox coupling, Intelligent coupling with IS passive and dynamic mapping Big Data platform for logs analytics, visual analytics Business Analytics Web portal for passenger behaviour and profile understanding, traffic anomaly detection: New components and use cases focused on mobility Approach based on space-time queries, BI, early warning engine, Big Analytics and optimization technics for Smart City Fraud detection Social Web Intelligence for National Security : Cyber-infringement detection and investigation SNA :social mining, crisis management Maritime security: predictive analysis & anomaly detection E-border: Big Analytics on passengers logs applications to vehicle, radar, weapon systems, transport HUMS :(Health & Usage Monitoring Systems) Ce document ne peut être reproduit, modifié, adapté, publié, traduit, d'une quelconque façon, en tout ou partie, ni divulgué à un tiers sans l'accord préalable et écrit de Thales THALES 2012 Tous Droits réservés Modèle trtp version 7.1.0

Big Analytics innovation trends at medium range horizon Coupling Auto-Encoders Neural Nets with Predictive Modeling for features extraction Opening the «Data Streaming Processing» (real time) to more sophisticated and powerful analytical tools Towards real life CEP Coupling «Genetic Algorithms» with «Relational linear transforms» Linearization procedures In Networks Analysis, addressing the complexity of dynamic graphs modeling. Dynamic Modularization Ce document ne peut être reproduit, modifié, adapté, publié, traduit, d'une quelconque façon, en tout ou partie, ni divulgué à un tiers sans l'accord préalable et écrit de Thales THALES 2012 Tous Droits réservés Modèle trtp version 7.1.0