Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012 Viswa Sharma Solutions Architect Tata Consultancy Services 1
Agenda What is Hadoop Why Hadoop? The Net Generation is here Sizing the Hadoop Gartner Hadoop Hype Cycle TCS view point Hadoop Eco System Landscape Examples of uses of Hadoop Transformational Platform Ad Hoc Analysis Analytics with Hadoop Applications of Hadoop Analytics Near Real Time Analysis What is the market Thank You TCS Confidential
What is Hadoop? Hadoop is the Name of a Toy Elephant Given To SCALE OUT COMPUTING PLATFORM WHICH PROCESSES INTENET SIZE DATA PARALLEL FILE SYSTEM MODLED AFTER GOOGLE FILE SYSTEM PARALLEL PROGRAMMNG ENVIRONMENT GOOGLE MAP/REDUCE OPEN SOURCE SOFTWARE COMMODITY HARDWARE 3 3
Why Hadoop? The Net Generation is here The Net Generation is inter-connected on a variety of Web based and Digital channels. Big Data : Web Scale 50 billion web pages 800 million Facebook users 1000 million Facebook pages 200 million Twitter accounts 100 million tweets per day 5 billion Google queries per day Millions of servers, Petabytes of data Varieties of Data Video / Audio Images / Pictures Diverse internal and external data Sources of Data News / Feeds / Blogs / forums Groups / Polls / Chats / Wiki Information is exploding all around But the challenge is to understand the it 4
Sizing the Hadoop Source: Pawyi Lee 5
Hadoop Hype Cycle Starts Gartner Hype Cycle 2012 6
TCS View Point: Hadoop Technology is here now Big Data Technology handles data at extreme scale and is characterized by Massive parallel computing to divide and conquer workloads. Extremely flexible to allow unlimited data manipulation and transformation Massively scalable in terms of both technology and cost Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data Hive, Mahout and R, enable query, analysis and running in memory compute intensive applications The ecosystem of Hadoop Technology is affordable, and within the reach of companies 7
Hadoop Eco System Landscape Analytics / Visualization Search No SQL Query Oriented Data Warehouse Data Integration Data Integration CEP Languages / Libraries Tool s Hadoop Distributions Appliance / MR Rewrite Cloud Distributions Map Reduce Distributed File System 8
Examples of Uses of Hadoop Hi Tech Process control for Microchip fabrication Network Management Supply Chain Management and analysis New Product development Content management solutions Travel, Transportation & Hospitality Better Travel searches Geo fencing Cross selling and up selling Intelligent traffic management Energy, Resources & Utilities Weather impact analysis on power generation Oil Rig data monitoring Smart meter data analysis Terrain data analysis for wind energy Insurance Claims analysis & Premium forecasting Claims Fraud detection & Revenue comparison Overall risk analysis & Re insurance risk assessment Policy pricing & Customer retention Smart Grids Government Fraud detection and cyber security Compliance and regulatory analysis Energy consumption and carbon footprint management Disaster Management 9
Hadoop as Transformation Platform in ETL Transactional Systems Within Hadoop Ecosystem MapReduce / Hive / Pig could be used to transform data within the distributed file system (HDFS). Data Warehouse MapReduce / Hive /Pig HDFS Hadoop Cluster Less number of Higher end nodes Tools like SQOOP could be leveraged to load data from and to HDFS 10
Hadoop as an ad-hoc analysis platform Transaction al Systems Data Warehouse Hadoop as an ad-hoc analysis platform MapReduce / Hive / Pig could be used to transform data within the distributed file system (HDFS), this could provide the business analytics team a platform for innovation MapReduce / Hive /Pig HDFS Data at lowest grain Hadoop Cluster Higher number of nodes for larger storage Tools like SQOOP could be leveraged to load data from and to HDFS 11
Analytics With Hadoop Prescriptive (What should happen?) Predictive (What will happen?) Descriptive (What has happened?) Optimization Simulation Optimizing outcomes Identifying possible outcomes Domain Expertise Text Analytics Data Mining Knowledge Predictive Modeling Statistical Analysis Visual Analytics Forecasting Describing and analyzing outcomes Analysis, Drill Down, Ad Hoc Reporting Dashboards and Scorecards Visual Analytics 12
Applications for Hadoop Analytics Smarter Healthcare Multi-channel sales Finance Log Analysis Homeland Security Traffic Control Telecom Search Quality Manufacturing Trading Analytics Fraud and Risk Retail: Churn, NBO 13
Hadoop Near Real Time Analytics External Inputs (incl Social Media) Complex Event Processing Rule / Pattern Matching on Streams. Fraud Detection Dist Processing : Processing is distributed Online Price Mgmt on a set of nodes and not the data. Yield Management Transactional Systems Rule Application Rule Discovery Learn Frauds Patterns Demand Signal Refinement Batch Map-Reduce Processing Rule / Pattern Discovery [on Time Series] Dist Processing : Map-Reduce or scalable time-series pattern mining. [Time Series] Mining and Rule Discovery Offline Online Real Time Self Learning Systems Complex / Dynamic Pattern Matching e.g. Trading Patterns, Mining Current Influencers Distributed Stream Processing [using MR] Rule / Pattern Discovery on Streams. Dist Processing : Both Processing and data are distributed on a set of nodes. e.g. C-MR (academic project) 14
What is the Market? 15
Thank You 5 December, 2012 16