Big Data: Quelques Enjeux Techniques

Big Data: Quelques Enjeux Techniques Essai de Typologie des Problèmes de Big Analytics J.F. Marcotorchino VP, Scientific Director, GBU SIX Thales Communications & Security

The information contained in this document and any attachments are the property of THALES. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document is strictly prohibited without Thales prior written approval. THALES 2011. Template trtp version 7.0.8 2 / Split BIG DATA/BIG ANALYTICS

3 / Definitions Big Data: All the technologies and techniques that help scaling Large File Storage (virtual) Distributed processing (Hadoop) / Map-reduce NoSQL databases / simple & complex query Big Analytics: Techniques that are executed on a BigData infrastructure and have the following properties: Adaptation of ad hoc techniques (statistics-learning) to this environment Scales Linearly (O(N) or O(NLog(N)) order of magnitude or subject to heavy potential parallelization Linearization is mandatory either at criteria level or at constraints polytopes level Use special type of learning techniques through dimensions reduction. The information contained in this document and any attachments are the property of THALES. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document is strictly prohibited without Thales prior written approval. THALES 2011. Template trtp version 7.0.8

4 / Les 4 V The 4 V Challenge Volume : Large Storage Capacity are available now NAS type (Network Attached Storage): Virtualized Storage Cloud Computing Velocity: Large Demand for Immediate results Stream Analytics for SEP/ CEP (Stream &Complex event processing) In memory Computations adapted to Key-Value stores Variety: Large Diversity of Heterogeneous Data Types Structured Data (classical DB entries) or Semi Structureed Data (Images with meta data added) Unstructured Data: Text, Speech, Raw Images etc Value: Intrinsic Value of the couple «Data/Information» is now recognized by Business companies la (((*valeur «α N» (α entier) on doit répartir les calculs sur α machines pour conserver The information contained in this document and any attachments are the property of THALES. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document is strictly prohibited without Thales prior written approval. THALES 2011. Template trtp version 7.0.8

5 / Some Confusions to Avoid Do not confound : Combinatorial Complexity vs Indexing complexity, difficulty of IT computations vs the management of huge data volumes (HPC vs BIG DATA) In the first case: It is not the data amount per se which is a drawback, but the intrinsic combinatorial structure of the problem to solve: Example: 10 29300 solutions (Berendt -Tassa estimate 2010) to explore for clustering a set of N=10000 objects or individuals. Nevertheless N=10000 is not a huge amount In the second case: It is the data amount itself which poses a problem, through the structure of the indexing and storing architectures. (Difficulty due to the scalability constraints) The information contained in this document and any attachments are the property of THALES. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document is strictly prohibited without Thales prior written approval. THALES 2011. Template trtp version 7.0.8

6 / How to address Scalability Problems Scalability by «Linearization» VS Scalability by «Parallelization» The information contained in this document and any attachments are the property of THALES. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document is strictly prohibited without Thales prior written approval. THALES 2011. Template trtp version 7.0.8 In the First Mode : If for a population of N objects the needed computing time is T, in case of a linear algorithm it will take a computing time αt if the population size jumps up from N to α N. In the Second Mode: If an algorithm dedicated to a population size N can be processed on a SINGLE machine within a time T, then if the la population scales up to α N (α integer ), computations can be distributed on «α» machines to keep a computing time equal to : T Combination of both modes is the best possible approach (if suitable)

An Operational Characterization of Big Analytics Methods Big Data Analytics : «Extended» VS «Intrinsic» cases «Extended» Case: Possible use of the NoSQL storing architectures, or new SQL ones Exhaustive Analysis of the whole data set is not mandatory at all «Analytic Sampling» or «Big Sampling» are sufficient in most cases: e.g: Customers Segmentation, CRM, Cross selling, Churn & Attrition Analysis, Intrusions Analysis or HUMS (Health & Usage Monitoring Systems). The remaining set of the population except «samples» is processed by «inferential segmentation» or by «linear assignment»

An Operational Characterization of Big Analytics Methods Big Data Analytics : «Extended» VS «Intrinsic» cases «Intrinsic» Case: It is mandatory to rely on the full data set (exhaustivity ), even if avoiding to do it, is still remaining a research topic No a priori knowledge, or partial knowledge of the population structure Data are stored through NoSQL architectures using the adequate correspondence formats (example for graphs DB: Neo4j, FlockDB ( open source distributed, fault-tolerant graph database for managing data at scale., chosen by Twitter) To manage the exhaustivity constraint, obligation to use heuristics or meta heuristics based upon linear iterations, or parallelization through distributed computations

Some NoSQL DB Types Amazon) DynamoDB (Amazon Key Value Stores Column Oriented DB (Faceboo Facebook) BigTable (Google Google) Document Oriented DB Complex grows like E Rel E = nb. of Entities Rel = average relationships / entity Infinity DB Graph Data Bases Neo4j

BIG DATA CONCEPTUAL FOUNDATIONS [Brewer CAP Assignment] Availability Voldemort CA AP It is impossible to satisfy the 3 items choose 2 CouchDB Consistancy CP Partition Tolerence HBase direction ou services MemcacheDB /Bekerley DB

Some ideas for solving Intrinsic Big Analytics approaches Use mainly exhaustive methods (if possible no statistical sampling) (Data Driven vs Hypothesis Driven ) Affinity Analysis & Sequential Patterns (pure linear matchings scalar products) Use Classifiers with linear criteria Practice Iterative Queries R 2 I 2 : Requêtage Récursif Itératif Intelligent (application de deux techniques en alternance: Similarité Régularisée + Clustering «on the fly») Unsupervised Clustering (no a priori) (Extending «No K-Means» approaches using linear relational criteria) Text mining (word spotting) Reticular Data Analysis (Social Nets, Huge IT Networks) Ce document ne peut être reproduit, modifié, adapté, publié, traduit, d'une quelconque façon, en tout ou partie, ni divulgué à un tiers sans l'accord préalable et écrit de Thales THALES 2012 Tous Droits réservés Modèle trtp version 7.1.0 Routing procedures, Modularizations, Dynamic Topology

The information contained in this document and any attachments are the property of THALES. You are hereby notified that any review, dissemination, distribution, copying or otherwise use of this document is strictly prohibited without Thales prior written approval. THALES 2011. Template trtp version 7.0.8 12 / BIG ANALYTICS TYPOLOGY

Tentative structuring of Big Analytics Approaches Learning &Neural Nets Self Encoded and Hourglass Shaped Neural Nets Reticular Data Structuring Social Networks Communities detection Level of Problem Complexity MDL Learning Models Learning Model for unsupervised Classif Limited Layers Neural Nets Supervised Rule Based Classification BiClass SVM Naïve Bayes Networks Multi Classes SVM MOLAP and XOLAP Classical BI Data Mining Image & Video Analytics Unsupervised Clustering Reticular Visual Analytics Parallel Coordinates Large Networks Topological Design Faces &Pattern Recognition Piecewise Linear Regression Sequential Patterns Recognition & Affinity Analysis Vector Matching Structuring Lack of Population Knowledge

An Example of Intrinsic Big Analytics Problem: Graphs Modularity Krebs Graph on American Politics S. Mandal (MIT) Liberal Centrist Conservative Girvan-Newman s Quadratic formulation modularity of network is the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random. ( Deviation to Independence ) Maximizing modularity rigorously may be NP-hard Use heuristic approaches MIT Heuristic Algo: Construct the modularity matrix and find its largest eigenvalue and eigenvector Partition network into two parts based on signs of elements in the largest eigenvector Repeat for each part If a proposed split does not cause modularity to increase, declare subgraph indivisible and do not split it When entire graph consists of indivisible subgraphs, stop Typical running time O(N 2 log N) for a sparse graph Ce document ne peut être reproduit, modifié, adapté, publié, traduit, d'une quelconque façon, en tout ou partie, ni divulgué à un tiers sans l'accord préalable et écrit de Thales THALES 2012 Tous Droits réservés Modèle trtp version 7.1.0

By relational transform we turn the criterion into a linear function subject to linear constraints X ij X ji = 0 (i,j) (Symmetry) X ii = 1 i (Reflexivity) X ij + X jk X ik 1 (i,j,k) (Transitivity) X ij {0,1} (Binarity) Idea : relying on the locally linear «Louvain» algorithm (Blondel- Guillaume) (Univ Louvain/UPMC LIP6), use the Linear Relational Form O(N LogN ) We can do more: using the genericity of the Louvain s algo we can use better linear criteria than the Girvan-Newman s one based on Optimal Transport justifications e.g:«deviation to Indetermination» (Patricia Conde- Cespèdes ) Ce document ne peut être reproduit, modifié, adapté, publié, traduit, d'une quelconque façon, en tout ou partie, ni divulgué à un tiers sans l'accord préalable et écrit de Thales THALES 2012 Tous Droits réservés Modèle trtp version 7.1.0

Big Analytics :Some Topics of Interest Big Analytics for Cyber-Security Big Analytics for Smart Transport Big Analytics for National Security Big Analytics for maintenance: Components for attack detection and investigation (Intelligent IDS from normalized log analytics, IS passive and dynamic mapping, logs analytics, cyber Intelligence) Attack detection from relational & content data, intelligent IDS and sandbox coupling, Intelligent coupling with IS passive and dynamic mapping Big Data platform for logs analytics, visual analytics Business Analytics Web portal for passenger behaviour and profile understanding, traffic anomaly detection: New components and use cases focused on mobility Approach based on space-time queries, BI, early warning engine, Big Analytics and optimization technics for Smart City Fraud detection Social Web Intelligence for National Security : Cyber-infringement detection and investigation SNA :social mining, crisis management Maritime security: predictive analysis & anomaly detection E-border: Big Analytics on passengers logs applications to vehicle, radar, weapon systems, transport HUMS :(Health & Usage Monitoring Systems) Ce document ne peut être reproduit, modifié, adapté, publié, traduit, d'une quelconque façon, en tout ou partie, ni divulgué à un tiers sans l'accord préalable et écrit de Thales THALES 2012 Tous Droits réservés Modèle trtp version 7.1.0

Big Analytics innovation trends at medium range horizon Coupling Auto-Encoders Neural Nets with Predictive Modeling for features extraction Opening the «Data Streaming Processing» (real time) to more sophisticated and powerful analytical tools Towards real life CEP Coupling «Genetic Algorithms» with «Relational linear transforms» Linearization procedures In Networks Analysis, addressing the complexity of dynamic graphs modeling. Dynamic Modularization Ce document ne peut être reproduit, modifié, adapté, publié, traduit, d'une quelconque façon, en tout ou partie, ni divulgué à un tiers sans l'accord préalable et écrit de Thales THALES 2012 Tous Droits réservés Modèle trtp version 7.1.0