In Organizations Mark Vervuurt Cluster Data Science & Analytics
AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning 7. Industrialization
Doug Cutting
Y2006 Hadoop Ecosystem MapReduce Parallel Data Batch Processing Framework Hadoop Distributed File System (HDFS) Store Data Redundantly
Hadoop Business Benefits Meet ETL Service Level Agreements (SLAs) Store Structured and Unstructured Data in One Place Storage and Batch Processing of Large Data Sets (PetaByte Scale) Cost effective Storage and Processing using Low End or Commodity Servers
Data Ingestion & Complex Event Processing
Import and Export Relational Data Import data from relational databases into Hadoop Export data from Hadoop into relational databases Database Sqoop Hadoop
Stream Data Stream Log files into Hadoop for Storage, Processing and Analysis
Stream Data High Throughput Distributed Messaging Queue
Complex Event Processing (CEP) Filter, Transform and Process Events
Complex Event Processing (CEP) Filter, Transform and Process Micros Batches of Events
Hadoop Business Benefits Data Warehouse Optimization (Near) Real-Time Recommendation Engines Internet Of Things Enabler Through Streaming, Streaming Analytics and Storage of Large Data Sets
SQL on Hadoop
Hive The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Hive SQL on Hadoop (Data Warehouse) MapReduce Parallel Data Batch Processing Framework Hadoop Distributed File System (HDFS) Store Data Redundantly
Impala
Hawq
Pig Apache Pig is a platform for analyzing large data sets that consists of a highlevel language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. Pig ETL (Data Warehouse) MapReduce Parallel Data Batch Processing Framework Hadoop Distributed File System (HDFS) Store Data Redundantly
Hadoop Business Benefits Query and Explore large Data Sets (PetaByte Scale) Query and Transform Large Data Sets
NoSQL
NoSQL Databases NoSQL Not Only SQL Big Data Storage and Querying Non Relational Data Model follows Query Horizontally Scalable
Cassandra NoSQL Database DataStax Enterprise Community Edition Optimized for High Availability Optimized for High-Throughput Writes Geographical Data Replication Integrates with Hadoop Ecosystem Cassandra Query Language Spark API for Cassandra
Hadoop Business Benefits Near Real-Time Query Response on Hadoop Store and Query Large Data Sets (PetaByte Scale) in a Database cost effectively on Low End or Commodity Servers Inserts, Deletes and Updates on Hadoop
InMemory
Spark
Pivotal GemfireXD
Hadoop Business Benefits Build Streaming Application Build Online Analytical Applications Real-Time and Fastest Query Response on Hadoop
Data Science & Machine Learning
Apache Mahout
Spark MLLib & GraphX
Hadoop Business Benefits Business Forecasting Preventive Maintenance Profiling & Anomalous Behavior Build Recommendation Engines Segment Customers Automatically
Industrialization
Deployment Modes Bare-Metal Virtualized Cloud As a Service
Hue
SAS
Informatica Power Center Big Data Goverance
Hadoop Business Benefits Easy and Enterprise Ready Hadoop
TRENDS Datafication of the Enterprise Multidisciplinary Teams Data Scientists Data Engineers
Questions & Answers
About Capgemini With almost 145,000 people in over 40 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2014 global revenues of EUR 10.573 billion. Together with its clients, Capgemini creates and delivers business and technology solutions that fit their needs and drive the results they want. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business Experience, and draws on Rightshore, its worldwide delivery model. Learn more about us at www.capgemini.com. www.capgemini.com The information contained in this presentation is proprietary. Copyright 2015 Capgemini. All rights reserved. Rightshore is a trademark belonging to Capgemini.