Deciphering Big Data Analytics - A Review Of Technology & Applications Ramesh Mahadik, M.Tech.(Computer Sc. & Engg.), I.I.T, Mumbai Director MCA Institute Of Management And Computer Studies - MCA Institute Mumbai, Maharashtra, India Ramesh.Imcost@Yahoo.Com; Rameshgm.Iitb@Gmail.Com ABSTRACT: There Is A Great Excitement Surrounding Big Data Analytics, In The World Today. Organizations Now Understand The Importance & Significance Of Data-Driven Decision Making, Because Of Which There Is A Growing Enthusiasm To The Idea Of Big Data. Data Becomes Big Data When Its Volume, Velocity, Or Variety Exceeds The Abilities Of Conventional IT Systems To Gulp, Store, Analyze, And Process It. Several Organizations Have The IT Systems And Expertise To Handle Large Quantities Of Structured Data, But With The Increasing Volume And Faster Flows Of Data, They Lack The Ability To Mine It And Derive Actionable Intelligence In A Timely Way. Big Data Analytics Addresses This Need For Evolved Data Processing & Analytics, Which Can Handle The Fast Growing, High Volume, Multiple Typed Data (Structured, Semi-Structured & Unstructured), Generated At High Speed. This Paper Explores The Technology Framework & Application Areas Of Big Data Analytics. Keywords: Big Data, Big Data Analytics, Hadoop, Mapreduce, Decision Support Systems, Data Mining, Business Intelligence 1. Introduction We Are Flooded With Data Today. A Wide Spectrum Of Application Areas, Collect Data At A Humungous Scale. Decisions That Previously Were Based On Guesswork, Or Crudely Constructed Models, Are Now Based On The Data Itself. Big Data Analytics Now Drives Nearly Every Aspect Of Our Modern Society, Including Retail, Manufacturing, Financial Services, Mobile Services, Social Media And Healthcare, To Name A Few. Apparently, Big Data Means Business Opportunities, But At The Same Time It Also Poses Major Research Challenges. According To Mckinsey & Co., Big Data Is The Next Frontier For Innovation, Competition And Productivity. The Impact Of Big Data Gives Not Only A Huge Potential For Competition And Growth For Individual Companies, But The Right Use Of Big Data Also Can Increase Productivity, Innovation, And Competitiveness For Entire Sectors And Economies. 2. Research Methodology An Exhaustive Study Of Various Texts, Research Articles And Materials Pertaining To Big Data Analytics Was Carried Out, With The Aim Of Understanding Its Technology Framework And Application Areas. INCON X 2015 36
3. What Is Big Data? Big Data Relates To Rapidly Growing, Structured & Unstructured Datasets With Sizes Beyond The Ability Of Conventional Database Tools To Store, Manage And Analyze Them. It Is Characterized Primarily By The 3Vs: Volume, Variety & Velocity. The 4 th Characteristic Considered Is Veracity. Speed, Accuracy & Complexity of Intelligence Small Data Sets Advanced Analytics Small Data Sets Traditional Analytics Big Data Big Data Analytics Big Data Traditional Analytics GB TB PB EB ZB (10 9 ) (10 12 ) (10 15 ) (10 18 ) (10 21 ) Size of Data Volume Velocity Large quantity of data which may be enterprise-specific, general or public Diverse set of data, created by social networking feeds, video audio, email, sensor data, etc Speed of data inflow as well as rate at which this fast moving data needs to be stored. Variety Figure 1: What Is Big Data? Veracity: Unlike Carefully Governed Internal Data, Most Big Data Comes From Sources Outside Our Control And Therefore Suffers From Significant Correctness Or Accuracy Problems. Veracity Represents Both The Credibility Of The Data Source As Well As The Suitability Of The Data For The Target Audience. 4. Understanding Big Data Analytics Fundamentally, Big Data Analytics Is The Process Of Examining Large Data Sets Containing A Variety Of Data Types To Uncover Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences And Other Useful Business Information. Lets Try To Understand What Big Data Analytics Is & What It Isn t. 4.1 What Big Data Analytics Is: i) A Technology-Enabled Strategy For Gaining Richer, Deeper Insights Into Customers, Partners, And The Business And Ultimately Gaining Competitive Advantage ii) Working With Data Sets Whose Size And Variety Is Beyond The Ability Of Conventional Database Software To Capture, Store, Manage, And Analyze. iii) Processing A Steady Stream Of Real-Time Data In Order To Make Time-Sensitive Decisions Faster Than Ever Before. iv) Distributed In Nature Analytics Processing Goes To Where The Data Is For Greater Speed And Efficiency. INCON X 2015 37
v) A New Paradigm In Which IT Collaborates With Business Users And Data Scientists To Identify And Implement Analytics That Will Increase Operational Efficiency And Solve New Business Problems. 4.2 What Big Data Analytics Isn t: i) Just About Technology. At The Business Level, It s About How To Exploit The Vastly Enhanced Sources Of Data To Gain Insight. ii) Only About Volume. It s Also About Variety And Velocity. But Perhaps Most Important It s About Value Derived From The Data. iii) Generated Or Used Only By Huge Online Companies Like Google Or Amazon Anymore. While Internet Companies May Have Pioneered The Use Of Big Data At Web Scale, Applications Touch Every Industry. iv) About None-Size-Fits-All Traditional Relational Databases Built On Shared Disk And Memory Architecture. Big Data Analytics Uses A Grid Of Computing Resources For Massively Parallel Processing (MPP). v) Meant To Replace Relational Databases Or The Data Warehouse. Structured Data Continues To Be Critically Important To Companies. However, Traditional Systems May Not Be Suitable For The New Sources And Contexts Of Big Data.,, 5. Architecture For Big Data In The World Of Big Data, Data Volumes We Need To Work With On A Day-To- Day Basis Have Outgrown The Storage And Processing Capabilities Of A Single Host. Big Data Brings With It Two Fundamental Challenges: (I) How To Store And Process Voluminous Data Sizes, And More Important, (Ii) How To Understand Data And Turn It Into A Competitive Advantage. Apache Hadoop Fills The Gap In The Market By Effectively Storing And Providing Computational Capabilities Over Substantial Amounts Of Data. It s A Distributed System Made Up Of A Distributed Filesystem And It Offers A Way To Parallelize And Execute Programs On A Cluster Of Machines. Its Already Been Adopted By Technology Giants Like Yahoo!, Facebook, And Twitter To Address Their Big Data Needs. From An Architectural Perspective, Hadoop, As Shown In Figure 2, Is A Distributed Master-Slave Architecture That Consists Of The Hadoop Distributed File System (HDFS) For Storage And Map-Reduce For Computational Capabilities. INCON X 2015 38
Figure 2: High Level Hadoop Architecture 5.1 Core Hadoop Components To Understand Hadoop s Architecture We ll Start By Looking At The Basics Of HDFS & Mapreduce. 5.1.1 HDFS (Hadoop Distributed File System) HDFS Is The Storage Component Of Hadoop. It s A Distributed Filesystem That s Modeled After The Google File System (GFS) Paper. HDFS Is Optimized For High Throughput And Works Best When Reading And Writing Large Files (Gigabytes And Larger). To Support This Throughput HDFS Leverages Unusually Large (For A Filesystem) Block Sizes And Data Locality Optimizations To Reduce Network Input/Output (I/O). Scalability And Availability Are Also Key Traits Of HDFS, Achieved In Part Due To Data Replication And Fault Tolerance. HDFS Replicates Files For A Configured Number Of Times, Is Tolerant Of Both Software And Hardware Failure, And Automatically Re- Replicates Data Blocks On Nodes That Have Failed. Figure 3 Shows A Logical Representation Of The Components In HDFS: Figure 3: HDFS Architecture 5.1.2 Mapreduce Mapreduce Is A Batch-Based, Distributed Computing Framework Modeled After Google s Paper On Mapreduce. It Allows You To Parallelize Work Over A Large Amount Of Raw Data, Such As Combining Web Logs With Relational Data From An OLTP Database To Model How Users Interact With Your Website. This Type Of Work, Which Could Take Days Or Longer Using Conventional Serial Programming Techniques, Can Be Reduced Down To Minutes Using Mapreduce On A Hadoop Cluster. It s A Nice Way To Partition Tasks Across Lots Of Machines And Can Handle Machine Failure. It Works INCON X 2015 39
Across Different Application Types, Like Search And Ads. It Allows Pre-Computation Of Useful Data, Find Word Counts, Sort Tbs Of Data, Etc. Mapreduce Decomposes Work Submitted By A Client Into Small Parallelized Map And Reduce Workers. The Role Of The Programmer Is To Define Map And Reduce Functions, Where The Map Function Outputs Key/Value Tuples, Which Are Processed By Reduce Functions To Produce The Final Output. The Power Of Mapreduce Occurs In Between The Map Output And The Reduce Input, In The Shuffle And Sort Phases. Hadoop s Mapreduce Architecture Is Similar To The Master-Slave Model In HDFS. The Main Components Of Mapreduce Are Illustrated In Its Logical Architecture, As Shown In Figure 4. 5.1.3 The Hadoop Ecosystem Figure 4: Mapreduce Logical Architecture INCON X 2015 40
Figure 5: The Hadoop Ecosystem Following Are The Components Of The Hadoop Ecosystem: HDFS: A Distributed, Fault Tolerant File System Mapreduce: A Framework For Writing/Executing Distributed, Fault Tolerant Algorithms Avro tm : A Data Serialization System. Hive tm : A Data Warehouse Infrastructure That Provides Data Summarization And Ad Hoc Querying. Pig tm : A High-Level Data-Flow Language And Execution Framework For Parallel Computation. Sqoop tm : A Package For Moving Data Between HDFS And Relational DB Systems H Base tm : A Scalable, Distributed Database That Supports Structured Data Storage For Large Tables. Zookeeper tm : A High-Performance Coordination Service For Distributed Applications. 6. An Architecture For Big Data Analytics The Big Data Analytics Architecture Described Below Utilizes The Massively Parallel, Distributed Storage And Processing Framework As Provided By Hadoop HDFS And Mapreduce. Figure 6: An Architecture For Big Data Analytics Structured Data Are Captured Through Various Data Sources Including OLTP Systems, Legacy Systems And External Systems. It Goes Through The ETL Process INCON X 2015 41
From The Source Systems To The Target Data Warehouse. Traditional Business Intelligence (BI) Batched Analytical Processing Tools Such As Online Analytical Processing (OLAP), Data Mining, And Query And Reporting, Can Be Used To Create The Business Intelligence To Enhance Business Operations And Decision Processes. Unstructured And Semi-Structured Big Data Sources Can Be Of A Wide Variety That Includes Data From Social Media, Mobile Device, Sensors, Documents And Reports, Web Logs, Call Records, Scientific Research, Satellites, And Geospatial Devices. They Are Loaded Into The Hadoop Distributed File System Cluster. Hadoop Mapreduce Provides The Fault-Tolerant Distributed Processing Framework Across The Hadoop Cluster, Where Batched Analytics Can Be Performed. Actionable Insight Resulting From Hadoop Mapreduce Analytics And Business Intelligence Analytics Can Be Consumed By Operational And Analytical Applications. Geospatial Intelligence Is Described As Using Data About Space And Time To Improve The Quality Of Predictive Analysis. For Example, Real-Time Recommendations Of Places Of Interest Can Be Based On The Real-Time Location From Smartphone Usage Location. This Real-Time Information Can Be Combined With Batched Analytics To Improve The Quality Of The Predictions. Real- Time Nosql Databases Such As Hbase Can Be Used In Conjunction With Hadoop To Provide Real-Time Read/Write Of Hadoop Data. Real-Time Insight Created By Real- Time Analytics Can Be Consumed By Real-Time Operations And Decision Processes. 7. Applications Of Big Data Analytics Eventually, Every Aspect Of Our Lives Will Be Affected By Big Data Analytics. Following Are Some Areas Where Big Data Analytics Is Already Making A Real Difference Today, With Widespread Use, As Well As The Highest Benefits: 7.1 Understanding And Targeting Customers This Is One Of The Biggest And Most Publicized Areas Of Big Data Use Today. Here, Big Data Is Used To Better Understand Customers And Their Behaviors And Preferences. Companies Are Keen To Expand Their Traditional Data Sets With Social Media Data, Browser Logs As Well As Text Analytics And Sensor Data To Get A More Complete Picture Of Their Customers. The Big Objective, In Many Cases, Is To Create Predictive Models. Some Examples: Telecom Companies Using Big Data, Can Now Better Predict Customer Churn; Wal-Mart Can Predict What Products Will Sell, And Car Insurance Companies Understand How Well Their Customers Actually Drive. 7.2 Understanding And Optimizing Business Processes Big Data Is Also Increasingly Used To Optimize Business Processes. Retailers Are Able To Optimize Their Stock Based On Predictions Generated From Social Media Data, Web Search Trends And Weather Forecasts. One Particular Business Process That Is Seeing A Lot Of Big Data Analytics Is Supply Chain Or Delivery Route Optimization. Here, Geographic Positioning And Radio Frequency Identification Sensors Are Used To Track Goods Or Delivery Vehicles And Optimize Routes By Integrating Live Traffic Data, Etc. 7.3 Personal Quantification And Performance Optimization INCON X 2015 42
Big Data Is Not Just For Companies And Governments But Also For All Of Us Individually. We Can Now Benefit From The Data Generated From Wearable Devices Such As Smart Watches Or Smart Bracelets. Take The Up Band From Jawbone As An Example: The Armband Collects Data On Our Calorie Consumption, Activity Levels, And Our Sleep Patterns. While It Gives Individuals Rich Insights, The Real Value Is In Analyzing The Collective Data. In Jawbone s Case, The Company Now Collects 60 Years Worth Of Sleep Data Every Night. Analyzing Such Volumes Of Data Will Bring Entirely New Insights That It Can Feed Back To Individual Users. 7.4 Improving Healthcare And Public Health The Computing Power Of Big Data Analytics Enables Us To Decode Entire DNA Strings In Minutes And Will Allow Us To Find New Cures And Better Understand And Predict Disease Patterns. Big Data Techniques Are Already Being Used To Monitor Babies In A Specialist Premature And Sick Baby Unit. By Recording And Analyzing Every Heart Beat And Breathing Pattern Of Every Baby, The Unit Was Able To Develop Algorithms That Can Now Predict Infections 24 Hours Before Any Physical Symptoms Appear. That Way, The Team Can Intervene Early And Save Fragile Babies In An Environment Where Every Hour Counts. Integrating Data From Medical Records With Social Media Analytics Enables Us To Monitor Flu Outbreaks In Real-Time, Simply By Listening To What People Are Saying, I.E. Feeling Rubbish Today - In Bed With A Cold. 7.5. Improving Sports Performance Most Elite Sports Have Now Embraced Big Data Analytics. The IBM Slamtracker Tool For Tennis Tournaments; Uses Video Analytics That Track The Performance Of Every Player In A Football Or Baseball Game, And Sensor Technology In Sports Equipment Such As Basket Balls Or Golf Clubs Allows Us To Get Feedback (Via Smart Phones And Cloud Servers) On Our Game And How To Improve It. Many Elite Sports Teams Also Track Athletes Outside Of The Sporting Environment Using Smart Technology To Track Nutrition And Sleep, As Well As Social Media Conversations To Monitor Emotional Wellbeing. 7.6. Improving Science And Research Science And Research Is Currently Being Transformed By The New Possibilities Big Data Brings. Take, For Example, CERN, The Swiss Nuclear Physics Lab With Its Large Hadron Collider, The World s Largest And Most Powerful Particle Accelerator. Experiments To Unlock The Secrets Of Our Universe How It Started And Works - Generate Huge Amounts Of Data. The CERN Data Center Has 65,000 Processors To Analyze Its 30 Petabytes Of Data. However, It Uses The Computing Powers Of Thousands Of Computers Distributed Across 150 Data Centers Worldwide To Analyze The Data. 7.7 Optimizing Machine And Device Performance INCON X 2015 43
Big Data Analytics Help Machines And Devices Become Smarter And More Autonomous. For Example, Big Data Tools Are Used To Operate Google s Self-Driving Car. The Toyota Prius Is Fitted With Cameras, GPS As Well As Powerful Computers And Sensors To Safely Drive On The Road Without The Intervention Of Human Beings. 7.8 Improving Security And Law Enforcement. Big Data Is Applied Heavily In Improving Security And Enabling Law Enforcement. The National Security Agency (NSA) In The U.S. Uses Big Data Analytics To Foil Terrorist Plots. Others Use Big Data Techniques To Detect And Prevent Cyber Attacks. Police Forces Use Big Data Tools To Catch Criminals And Even Predict Criminal Activity And Credit Card Companies Use Big Data Use It To Detect Fraudulent Transactions. 7.9 Improving And Optimizing Cities And Countries Big Data Is Used To Improve Many Aspects Of Our Cities And Countries. For Example, It Allows Cities To Optimize Traffic Flows Based On Real Time Traffic Information As Well As Social Media And Weather Data. A Bus Would Wait For A Delayed Train And Where Traffic Signals Predict Traffic Volumes And Operate To Minimize Jams. 7.10 Financial Trading The Final Category Of Big Data Application Comes From Financial Trading. High- Frequency Trading (HFT) Is An Area Where Big Data Finds A Lot Of Use Today. Here, Big Data Algorithms Are Used To Make Trading Decisions. Today, The Majority Of Equity Trading Takes Place Via Data Algorithms That Increasingly Take Into Account Signals From Social Media Networks And News Websites To Make, Buy And Sell Decisions In Split Seconds. 8. Conclusion We Have Already Entered Into The Era Of Big Data Analytics. Through Better Analysis Of The Large Volumes Of Data That Are Becoming Available, There Is The Potential For Making Faster Advances In Many Scientific Disciplines And Improving The Profitability And Success Of Many Enterprises. This Paper Explains The Characteristics Of Big Data & Reviews An Architecture For Big Data Analytics. It Also Presents Some Of The Most Widespread Application Areas Of Big Data Analytics. However, We Need To Address Many Technical Challenges Before The Potential Of Big Data Analytics Can Be Fully Realized. These Challenges Include Not Just The Obvious Issues Of Scale, But Also Heterogeneity, Lack Of Structure, Security, Privacy, Timeliness And Visualization, At All Stages Of The Analysis Pipeline From Data Acquisition To Result Interpretation. 9. References 1. Joseph O. Chan, Roosevelt University, USA: An Architecture For Big Data Analytics, Communications Of The IIMA, 1 2013, Issue 2 2. Rob Peglar: Introduction To Analytics & Big Data Hadoop, Storage Networking Industry Association, 2012 INCON X 2015 44
3. Apache Software Foundation: (2013a), Welcome To Apache Hadoop, Retrieved From Http://Hadoop.Apache.Org/ 4. Apache Software Foundation: (2013b), Welcome To Apache Hbase, Retrieved From Http://Hbase. Apache.Org/ Apache Software Foundation. 5. Architecture Overview: (2013c), What Is The Difference Between Hbase And Hadoop/HDFS? Retrieved From Http://Hbase.Apache.Org/Book/Architecture. Html#Arch.Overview 6. NASSCOM, New Delhi: Big Data - The Next Big Thing, 2012, Www.Nasscom.In 7. Community Paper By US Researchers, Philip Bernstein - Microsoft, Elisa Bertino - Purdue Univ., Umeshwar Dayal - HP, Michael Franklin - UC Berkeley, Johannes Gehrke - Cornell Univ.: Challenges & Opportunities With Big Data, 2012 8. NESSI White Paper: Big Data A New World Of Opportunities, 2012 9. An Oracle White Paper: Big Data Analytics Advanced Analytics In Oracle Database, 2013 10. Intel IT Center: Getting Started With Big Data Steps IT Managers Can Take To Move Forward With Apache Hadoop Software, 2013, Www.Intel.Com 11. Alex Holmes: Hadoop In Practice, 2012, Www.It-Ebooks.Info 12. Bernard Marr, Keynote Speaker And Consultant In Strategy, Performance Management, Analytics, Kpis And Big Data: The Awesome Ways Big Data Is Used Today To Change Our World, November, 2013, www.linkedin.com ***** INCON X 2015 45