W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Size: px
Start display at page:

Download "W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract"

Transcription

1 W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the approaches and solutions to manage structure and apply analytics on this voluminous data to draw valuable insights and business intelligence. The paper also elaborates on using the Hadoop eco system for deriving structure from this large scale data. Impetus Technologies, Inc. October 2010

2 Table of Contents Introduction... 3 Challenges facing large data management... 5 Finding solutions to large data challenges... 7 BI implementation approaches over large data... 9 Approach 1: Analytics over Hadoop with MPP DWs... 9 Approach 2: Indirect Analytics over Hadoop... 9 Approach 3: Direct Analytics over Hadoop Case Study Summary

3 Introduction Large Data can be described as data that occupies very large storage space in the file system, which in the present context can range from 1 terabyte to petabytes. The world is currently witnessing an upsurge in digital data, and organizations require solutions that can help them extract valuable intelligence and insights from this data. In this scenario, the role of Business Intelligence and Analytics in drawing insights from large scale data has increased exponentially. Take the instance of Financial Data Monitoring systems. We all know that billions/trillions of financial transactions take place every single hour. Now for an organization that provides a fraud detection system for financial transactions, it would be nearly impossible to detect suspicious transactions over the huge amount of data generated. Figure 1 Deriving Intelligence from Unstructured Data For such enterprises, it is imperative to find the best solutions to: Store Data Cleanse and Transform Data Apply Analytics over this Set 3

4 The data has to be collected over a long period of time, maybe months, or even years, from various distributed geographical locations. They require an excellent and cost effective framework for processing this data in a distributed manner on commodity hardware. The data will have to be processed and summarized to its farthest limit, and then pursued and processed time and again. The various mechanisms, such as generating alerts on any dubious transactions would be required by the system. This is where BI and analytics come into the picture. Organizations today are also driving their businesses based on the feeling and emotions of their customers and potential customers. While they have huge amounts of unstructured data available in the form of tweets, comments, posting, blogs etc., it needs to be properly mined and analyzed to gauge the sentiments of the customers and identify their compliments, comments and grievances. Based on this feedback and inputs, organizations can decide the course of their business roadmaps. They can also carry on predictive analysis by identifying the patterns in customer behavior as well as their preferences. Clearly, every business domain can use the data that it already has, or wants to gather, in myriad useful ways and use it to create more business opportunities and ideas. 4

5 Challenges facing large data management Large data is one of the biggest challenges facing organizations today, and applying analytics is an even more difficult job. Internet applications are showing nearly 100% growths every year and enterprise applications are also gathering momentum. Thus, addressing the concerns related to large data has become the top priority. Figure 2 Analytics Challenges The main challenges associated with Large Enterprise Data Warehouses are as follows: Disparate Sources: Data comes from many different sources and therefore, companies need to identify these data sources and the issues related to them. They have to perform cleansing operations to obtain the sensible data and mechanisms to draw conclusions from the data. Identification of Useful Data: The Extraction Transformation and Loading (ETL) process is very resource intensive. Companies need to throw away a lot of garbage data, correct the missing information and find replacements for certain fields that may be more useful for internal processing. 5

6 Structure Data: A structure has to be derived out of the raw data, so that analytics can happen. Store Data: One of the biggest challenges with Large Data is storing it. For deriving any sort of intelligence from data of this magnitude, effective mechanisms are required to store it. Use of RDBMS: A lot of traditional data warehouses are built around RDBMS. With the data warehouses growing to terabytes and beyond, conventional relational database management systems are not the best option when it comes to managing the large data sets. Minimizing Response Time: The time needed to perform analytics is very high. This time needs to be reduced for companies to derive results in the form of conclusions. In order to overcome these challenges, organizations must choose the appropriate Business Intelligence strategy, based on factors such as ease and cost of implementation. Since, gathering the right information from the data can be an expensive proposition; companies must select the right BI solution for their business needs, to lower impact on their IT budgets. They must also check to see how easily the solution can be implemented and start giving returns. At the same time, organizations must identify whether the data that is being gathered and stored needs to be analyzed in real time, using high touch queries or is whether the same can be processed in batches, using a framework like Hadoop to produce results in an offline manner. Companies will additionally need to gauge the best strategy to store their large data based on its size and usage. The data, which is available in different formats like Structured or Unstructured forms, might intersect at some level and need to be combined to provide more useful business insights. The other challenges that one has to consider are complex computations, data security, scalability, and accessibility and fault tolerance. Based on the business needs, companies have to weigh each of these challenges and identify the BI strategy that best suits them. Properly utilized, large data can emerge as a differentiator for companies. Organizations can use it to gain an edge over competition. By effectively managing as well as storing terabytes to petabytes of data, performing rich analysis on these massive data sets, and doing it all at ultrafast speeds, organizations can transform their voluminous data into a business asset. 6

7 Finding solutions to large data challenges Although, there are various solutions available for storing large amounts of data, Hadoop is one of the best options available. Hadoop is a flexible infrastructure for large scale computation and data processing on a network of commodity hardware. It is a common infrastructure pattern extracted from building distributed systems. Hadoop takes a large piece of data, breaks it up into smaller pieces and distributes it into various nodes of the cluster. These nodes then execute the pieces in parallel and independently, feeding back into each other. They follow the programming paradigm of MapReduce, a patented software framework introduced by Google to support distributed computing on large data sets on computer clusters. MapReduce is significant because it allows developers to create a large variety of parallel programs without having to worry about programming for intracluster communication, task monitoring or task failure handling. Earlier, creating programs that handled these issues could consume literally months of programming time. MapReduce programs have been created for everything from text tokenization, indexing, search to data mining and machine learning algorithms. The best part about Hadoop is that it is an Open Source project initiated by Apache. It was created taking inspiration from Google s MapReduce and GFS papers. Today, Yahoo is one of the largest contributors to the evolution of Hadoop and also responsible for getting Hadoop to its current state. It is used extensively by businesses these days because of its various advantages like: Ability to linearly scale up to thousands of nodes Ability to use commodity hardware, leading to a huge cost advantage Flexibility, which gives Hadoop its real power and allows it to implement map and reduce processes to solve particular problems It s simple programming and execution environment, that accommodates for failures Its DFS for storing data and the Map Reduce execution paradigm The availability of tools built around Hadoop such as Hive and Pig that simplify writing MapReduce jobs Hive for instance, is a database or a data warehouse infrastructure that functions on top of Hadoop. It tries to remove most of the complications with Hadoop and provides programmers with an easy interface to Hadoop. Hive is an effective tool for enabling easy ETL, generating summarizations, and putting 7

8 structures on the data. It also has the capability to query and analyze large data sets stored in Hadoop files. When Hadoop is used as an ETL tool, it enables the storage of huge amounts of unstructured data and also scales up massively with time and as the inflow of data increases. Inputs coming in the raw format from various sources can be collected and transformed using Hadoop ETL. It is possible to create a daisy chain of Hadoop nodes to crunch a huge data set. The structured data can then be utilized for performing analytics, Reporting and deriving conclusions. When choosing a BI Strategy for your organization there are a few things you should consider: 1. Overall ease & cost of implementation: Gathering the right information from the data can be an expensive proposition. Choosing the right BI solution for your business needs will definitely impact the IT budgets if not thought over prudently. Another consideration to be made before choosing a solution is the ease with which it can be implemented and start giving returns. 2. Real Time Analysis vs. Batch Analysis: One needs to identify whether the data that is being gathered and stored needs to be analyzed in real time, using High touch queries or is whether the same can be processed in batches that can use a framework like Hadoop to process and produce results in an offline manner. 3. End User Ad hoc Analytics: Every enterprise today has the ability to determine and define their requirements and needs. Service providers with a reactive approach to those needs can only play a catch up role rather than acting as the innovators. It also gives a competitive advantage where in your smart customers have an option of gathering the data that they need. Based on the business needs, you have to weigh each of these challenges and identify the right BI strategy for yourself. The next section elaborates the various popular BI approaches that are generally being used for achieving BI implementations over large data. 8

9 BI implementation approaches over large data Approach 1: Analytics over Hadoop with MPP DWs Today, a lot of options are available in the market that allows the integration of MPP DWs with Hadoop. This is worth considering for users that have a large amount of data even after applying summarization over it. Using Hadoop for cleaning/transforming the data into a structured form allows them to load the data into any of the available options of MPP DWs. While the data is being uploaded, they can write UDFs to perform Database level analytics and then integrate the same with BI solutions using ODBC/JDBC connectivity for end user analytics and reporting. Figure 3 Analytics over Hadoop with MPP DW Also, using MPP DWs allows companies to deploy various performance enhancement techniques like index compression, materialized views, result set caching and I/O sharing. Alternatively, some of the MPP DWs may also provide organizations with a robust framework that supports MR jobs executions within their own clusters at MPP levels, providing them with two levels of parallel processing. This feature is really good for working with high touch queries and provides an excellent framework for end user ad hoc analytics. Approach 2: Indirect Analytics over Hadoop Another interesting approach could be to use Hadoop for cleaning/transforming the data into a structured form and then loading the same into the RDBMS 9

10 databases. Hadoop can efficiently access the data between the RDBMS data sources and Hadoop systems through DBInputFormat and DBOutputFormat interfaces. Once the Unstructured Data is processed, it can be then pushed to an RDBMS database which can subsequently act as a data source for any BI solution. Figure 4 Indirect Analytics over Hadoop This approach provides the end user with the flexibility of parallel processing of Hadoop and an SQL interface at the summarized data level. This approach works well when the summarized data is not too big to pose a challenge for the RDBMS database being used. This solution is not as expensive as the first approach. It is also suitable for the high touch queries where the user wants to perform real time ad hoc analytics as most of the RDBMS databases come with a comprehensive set of performance enhancement techniques. 10

11 Approach 3: Direct Analytics over Hadoop We can also apply analytics directly over a Hadoop System without moving it to any RDBMS database. It can prove to be a very effective practice to analyze the data directly from the Hadoop file system. If we have a scenario where the processed & summarized data in itself is very huge, is placed on the Hadoop system. So, we do not want to get into the complications of moving the data out of the Hadoop system either to a MPP DW or RDBMS. This can be done by using Hive as an interface for the data present on Hadoop systems. Figure 5 Direct Analytics over Hadoop This approach allows you to do batch and asynchronous analytics over the same data present over the Hadoop system. It is a very cost effective approach as it does not involve any expense in managing the separate data source other than your existing Hadoop System. It also provides you with the flexibility of scaling to any level with your summarized data. The Large Data Processing and BI Strategy Matrix below can be used as a guideline to choose the right BI strategy for our business needs. 11

12 Figure 6 The Large Scale BI Strategy Case Study One of the customers of Intellicus, a global leader in digital marketing intelligence, was facing the challenge of processing and aggregating statistical advertising data of more than GBs. It wanted to use this data for mining behavioral insights to help its clients better understand their own customers and leverage and profit from the rapidly evolving World Wide Web and mobile arena. The problem was just not limited to creating a storage system for this data but also running analytics over it on a monthly basis. The intent was to extract specific patterns and then aggregate them based on complex parameters and weighing systems. Impetus quickly realized that for deriving intelligence from data of this magnitude, a conventional relational database management system would be inadequate in the long run. We therefore developed and deployed an optimum Hadoop Java implementation of the product, tuning the Hadoop infrastructure configuration to get the maximum throughput. Intellicus also used MapReduce and Tokyocabinet, which is a Berkley like file storage system to handle the 12

13 processing of a large metadata (>1.2 GB). The Java based implementation helped Intellicus achieve optimum results in the given small cluster. Summary Impetus believes that organizations have to choose the solutions that best suit their needs, to solve their Large Data challenges. At the same time, they also have to select a BI strategy based on their business requirements. Nevertheless, it is desirable to get all the possible options available under the same hood, as it will help in reducing the complications that arise when dealing with multiple alternatives to achieve a common goal. According to Impetus, the ideal solution will be to provide easy interfacing of the data present on Hadoop, the RDBMS and MPP DWs, and create a common platform for doing BI and analytics over that data. When it comes to batch processing, having out of thebox scheduling and asynchronous execution of MR Jobs and BI analytics components like dashboards, Reports, is critical. About Impetus Impetus Technologies offers Product Engineering and Technology R&D services for software product development. With ongoing investments in research and application of emerging technology areas, innovative business models, and an agile approach, we partner with our client base comprising large scale ISVs and technology innovators to deliver cutting edge software products. Our expertise spans the domains of Data Analytics, Large Data Management, SaaS, Cloud Computing, Mobility Solutions, Testing, Performance Engineering, and Social Media among others. Impetus Technologies, Inc Stevens Creek Boulevard, Suite 450, San Jose, CA 95129, USA Tel: inquiry@impetus.com Regional Development Centers INDIA: New Delhi Indore Hyderabad Disclaimers The information contained in this document is the proprietary and exclusive property of Impetus Technologies Inc. except as otherwise indicated. No part of this document, in whole or in part, may be reproduced, stored, transmitted, or used for design purposes without the prior written permission of Impetus Technologies Inc. 13

W H I T E P A P E R. Building your Big Data analytics strategy: Block-by-Block! Abstract

W H I T E P A P E R. Building your Big Data analytics strategy: Block-by-Block! Abstract W H I T E P A P E R Building your Big Data analytics strategy: Block-by-Block! Abstract In this white paper, Impetus discusses how you can handle Big Data problems. It talks about how analytics on Big

More information

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Business Intelligence for Big Data

Business Intelligence for Big Data Business Intelligence for Big Data Will Gorman, Vice President, Engineering May, 2011 2010, Pentaho. All Rights Reserved. www.pentaho.com. What is BI? Business Intelligence = reports, dashboards, analysis,

More information

How to Enhance Traditional BI Architecture to Leverage Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

Using Tableau Software with Hortonworks Data Platform

Using Tableau Software with Hortonworks Data Platform Using Tableau Software with Hortonworks Data Platform September 2013 2013 Hortonworks Inc. http:// Modern businesses need to manage vast amounts of data, and in many cases they have accumulated this data

More information

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap 3 key strategic advantages, and a realistic roadmap for what you really need, and when 2012, Cognizant Topics to be discussed

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Big Data for Investment Research Management

Big Data for Investment Research Management IDT Partners www.idtpartners.com Big Data for Investment Research Management Discover how IDT Partners helps Financial Services, Market Research, and Investment Management firms turn big data into actionable

More information

Cost-Effective Business Intelligence with Red Hat and Open Source

Cost-Effective Business Intelligence with Red Hat and Open Source Cost-Effective Business Intelligence with Red Hat and Open Source Sherman Wood Director, Business Intelligence, Jaspersoft September 3, 2009 1 Agenda Introductions Quick survey What is BI?: reporting,

More information

Big Data Defined Introducing DataStack 3.0

Big Data Defined Introducing DataStack 3.0 Big Data Big Data Defined Introducing DataStack 3.0 Inside: Executive Summary... 1 Introduction... 2 Emergence of DataStack 3.0... 3 DataStack 1.0 to 2.0... 4 DataStack 2.0 Refined for Large Data & Analytics...

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Information Architecture

Information Architecture The Bloor Group Actian and The Big Data Information Architecture WHITE PAPER The Actian Big Data Information Architecture Actian and The Big Data Information Architecture Originally founded in 2005 to

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc. Beyond Web Application Log Analysis using Apache TM Hadoop A Whitepaper by Orzota, Inc. 1 Web Applications As more and more software moves to a Software as a Service (SaaS) model, the web application has

More information

Trustworthiness of Big Data

Trustworthiness of Big Data Trustworthiness of Big Data International Journal of Computer Applications (0975 8887) Akhil Mittal Technical Test Lead Infosys Limited ABSTRACT Big data refers to large datasets that are challenging to

More information

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012. Viswa Sharma Solutions Architect Tata Consultancy Services

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012. Viswa Sharma Solutions Architect Tata Consultancy Services Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012 Viswa Sharma Solutions Architect Tata Consultancy Services 1 Agenda What is Hadoop Why Hadoop? The Net Generation is here Sizing the

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

Big Data and Market Surveillance. April 28, 2014

Big Data and Market Surveillance. April 28, 2014 Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part

More information

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

BIG DATA CHALLENGES AND PERSPECTIVES

BIG DATA CHALLENGES AND PERSPECTIVES BIG DATA CHALLENGES AND PERSPECTIVES Meenakshi Sharma 1, Keshav Kishore 2 1 Student of Master of Technology, 2 Head of Department, Department of Computer Science and Engineering, A P Goyal Shimla University,

More information

Big Data and Hadoop for the Executive A Reference Guide

Big Data and Hadoop for the Executive A Reference Guide Big Data and Hadoop for the Executive A Reference Guide Overview The amount of information being collected by companies today is incredible. Wal- Mart has 460 terabytes of data, which, according to the

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

The big data revolution

The big data revolution The big data revolution Friso van Vollenhoven (Xebia) Enterprise NoSQL Recently, there has been a lot of buzz about the NoSQL movement, a collection of related technologies mostly concerned with storing

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

The 3 questions to ask yourself about BIG DATA

The 3 questions to ask yourself about BIG DATA The 3 questions to ask yourself about BIG DATA Do you have a big data problem? Companies looking to tackle big data problems are embarking on a journey that is full of hype, buzz, confusion, and misinformation.

More information

Agile Business Intelligence Data Lake Architecture

Agile Business Intelligence Data Lake Architecture Agile Business Intelligence Data Lake Architecture TABLE OF CONTENTS Introduction... 2 Data Lake Architecture... 2 Step 1 Extract From Source Data... 5 Step 2 Register And Catalogue Data Sets... 5 Step

More information

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Prerita Gupta Research Scholar, DAV College, Chandigarh Dr. Harmunish Taneja Department of Computer Science and

More information

The Rise of Industrial Big Data

The Rise of Industrial Big Data GE Intelligent Platforms The Rise of Industrial Big Data Leveraging large time-series data sets to drive innovation, competitiveness and growth capitalizing on the big data opportunity The Rise of Industrial

More information

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this

More information

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec 20 2011

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec 20 2011 Online Content Optimization Using Hadoop Jyoti Ahuja Dec 20 2011 What do we do? Deliver right CONTENT to the right USER at the right TIME o Effectively and pro-actively learn from user interactions with

More information

Big Data at Cloud Scale

Big Data at Cloud Scale Big Data at Cloud Scale Pushing the limits of flexible & powerful analytics Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Traditional BI vs. Business Data Lake A comparison

Traditional BI vs. Business Data Lake A comparison Traditional BI vs. Business Data Lake A comparison The need for new thinking around data storage and analysis Traditional Business Intelligence (BI) systems provide various levels and kinds of analyses

More information

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM Using Big Data for Smarter Decision Making Colin White, BI Research July 2011 Sponsored by IBM USING BIG DATA FOR SMARTER DECISION MAKING To increase competitiveness, 83% of CIOs have visionary plans that

More information

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP Business Analytics for All Amsterdam - 2015 Value of Big Data is Being Recognized Executives beginning to see the path from data insights to revenue

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012 Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster Nov 7, 2012 Who I Am Robert Lancaster Solutions Architect, Hotel Supply Team rlancaster@orbitz.com @rob1lancaster Organizer of Chicago

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld Tapping into Hadoop and NoSQL Data Sources in MicroStrategy Presented by: Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop? Customer Case

More information

Big Data for Investment Research Management

Big Data for Investment Research Management IDT Partners www.idtpartners.com Big Data for Investment Research Management Discover how IDT Partners helps Financial Services, Market Research, and Investment firms turn big data into actionable research

More information

Apache Hadoop Patterns of Use

Apache Hadoop Patterns of Use Community Driven Apache Hadoop Apache Hadoop Patterns of Use April 2013 2013 Hortonworks Inc. http://www.hortonworks.com Big Data: Apache Hadoop Use Distilled There certainly is no shortage of hype when

More information

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Testing 3Vs (Volume, Variety and Velocity) of Big Data Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used

More information

Microsoft Analytics Platform System. Solution Brief

Microsoft Analytics Platform System. Solution Brief Microsoft Analytics Platform System Solution Brief Contents 4 Introduction 4 Microsoft Analytics Platform System 5 Enterprise-ready Big Data 7 Next-generation performance at scale 10 Engineered for optimal

More information

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Drive operational efficiency and lower data transformation costs with a Reference Architecture for an end-to-end optimization and offload

More information

BIG DATA-AS-A-SERVICE

BIG DATA-AS-A-SERVICE White Paper BIG DATA-AS-A-SERVICE What Big Data is about What service providers can do with Big Data What EMC can do to help EMC Solutions Group Abstract This white paper looks at what service providers

More information

The 4 Pillars of Technosoft s Big Data Practice

The 4 Pillars of Technosoft s Big Data Practice beyond possible Big Use End-user applications Big Analytics Visualisation tools Big Analytical tools Big management systems The 4 Pillars of Technosoft s Big Practice Overview Businesses have long managed

More information

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy Presented by: Jeffrey Zhang and Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop?

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities Technology Insight Paper Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities By John Webster February 2015 Enabling you to make the best technology decisions Enabling

More information

Large-Scale Test Mining

Large-Scale Test Mining Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

More information

Three Open Blueprints For Big Data Success

Three Open Blueprints For Big Data Success White Paper: Three Open Blueprints For Big Data Success Featuring Pentaho s Open Data Integration Platform Inside: Leverage open framework and open source Kickstart your efforts with repeatable blueprints

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Big Data for the Rest of Us Technical White Paper

Big Data for the Rest of Us Technical White Paper Big Data for the Rest of Us Technical White Paper Treasure Data - Big Data for the Rest of Us 1 Introduction The importance of data warehousing and analytics has increased as companies seek to gain competitive

More information

Indian Journal of Science The International Journal for Science ISSN 2319 7730 EISSN 2319 7749 2016 Discovery Publication. All Rights Reserved

Indian Journal of Science The International Journal for Science ISSN 2319 7730 EISSN 2319 7749 2016 Discovery Publication. All Rights Reserved Indian Journal of Science The International Journal for Science ISSN 2319 7730 EISSN 2319 7749 2016 Discovery Publication. All Rights Reserved Perspective Big Data Framework for Healthcare using Hadoop

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Big Data - Infrastructure Considerations

Big Data - Infrastructure Considerations April 2014, HAPPIEST MINDS TECHNOLOGIES Big Data - Infrastructure Considerations Author Anand Veeramani / Deepak Shivamurthy SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy Native Connectivity to Big Data Sources in MicroStrategy 10 Presented by: Raja Ganapathy Agenda MicroStrategy supports several data sources, including Hadoop Why Hadoop? How does MicroStrategy Analytics

More information

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014 5 Keys to Unlocking the Big Data Analytics Puzzle Anurag Tandon Director, Product Marketing March 26, 2014 1 A Little About Us A global footprint. A proven innovator. A leader in enterprise analytics for

More information

Big Data Introduction

Big Data Introduction Big Data Introduction Ralf Lange Global ISV & OEM Sales 1 Copyright 2012, Oracle and/or its affiliates. All rights Conventional infrastructure 2 Copyright 2012, Oracle and/or its affiliates. All rights

More information

The Next Wave of Data Management. Is Big Data The New Normal?

The Next Wave of Data Management. Is Big Data The New Normal? The Next Wave of Data Management Is Big Data The New Normal? Table of Contents Introduction 3 Separating Reality and Hype 3 Why Are Firms Making IT Investments In Big Data? 4 Trends In Data Management

More information

The Future of Data Management

The Future of Data Management The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

More information

Harnessing the power of advanced analytics with IBM Netezza

Harnessing the power of advanced analytics with IBM Netezza IBM Software Information Management White Paper Harnessing the power of advanced analytics with IBM Netezza How an appliance approach simplifies the use of advanced analytics Harnessing the power of advanced

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Tap into Hadoop and Other No SQL Sources

Tap into Hadoop and Other No SQL Sources Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Big Data Can Drive the Business and IT to Evolve and Adapt

Big Data Can Drive the Business and IT to Evolve and Adapt Big Data Can Drive the Business and IT to Evolve and Adapt Ralph Kimball Associates 2013 Ralph Kimball Brussels 2013 Big Data Itself is Being Monetized Executives see the short path from data insights

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

1 Performance Moves to the Forefront for Data Warehouse Initiatives. 2 Real-Time Data Gets Real

1 Performance Moves to the Forefront for Data Warehouse Initiatives. 2 Real-Time Data Gets Real Top 10 Data Warehouse Trends for 2013 What are the most compelling trends in storage and data warehousing that motivate IT leaders to undertake new initiatives? Which ideas, solutions, and technologies

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies 1 Copyright 2011, Oracle and/or its affiliates. All rights Big Data, Advanced Analytics:

More information

WHITE PAPER. Harnessing the Power of Advanced Analytics How an appliance approach simplifies the use of advanced analytics

WHITE PAPER. Harnessing the Power of Advanced Analytics How an appliance approach simplifies the use of advanced analytics WHITE PAPER Harnessing the Power of Advanced How an appliance approach simplifies the use of advanced analytics Introduction The Netezza TwinFin i-class advanced analytics appliance pushes the limits of

More information

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to

More information

Big Data Analytics Platform @ Nokia

Big Data Analytics Platform @ Nokia Big Data Analytics Platform @ Nokia 1 Selecting the Right Tool for the Right Workload Yekesa Kosuru Nokia Location & Commerce Strata + Hadoop World NY - Oct 25, 2012 Agenda Big Data Analytics Platform

More information

White Paper: What You Need To Know About Hadoop

White Paper: What You Need To Know About Hadoop CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack

More information

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP Pythian White Paper TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP ABSTRACT As companies increasingly rely on big data to steer decisions, they also find themselves looking for ways to simplify

More information

BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS

BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS Megha Joshi Assistant Professor, ASM s Institute of Computer Studies, Pune, India Abstract: Industry is struggling to handle voluminous, complex, unstructured

More information

NoSQL for SQL Professionals William McKnight

NoSQL for SQL Professionals William McKnight NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to

More information

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ End to End Solution to Accelerate Data Warehouse Optimization Franco Flore Alliance Sales Director - APJ Big Data Is Driving Key Business Initiatives Increase profitability, innovation, customer satisfaction,

More information

HDP Enabling the Modern Data Architecture

HDP Enabling the Modern Data Architecture HDP Enabling the Modern Data Architecture Herb Cunitz President, Hortonworks Page 1 Hortonworks enables adoption of Apache Hadoop through HDP (Hortonworks Data Platform) Founded in 2011 Original 24 architects,

More information

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

<Insert Picture Here> Oracle and/or Hadoop And what you need to know Oracle and/or Hadoop And what you need to know Jean-Pierre Dijcks Data Warehouse Product Management Agenda Business Context An overview of Hadoop and/or MapReduce Choices, choices,

More information