Data Solutions with Hadoop



Similar documents
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

INTRODUCTION TO CASSANDRA

Large scale processing using Hadoop. Ján Vaňo

BIG DATA SOLUTION DATA SHEET

Big Data on Microsoft Platform

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Virtualizing Apache Hadoop. June, 2012

International Journal of Innovative Research in Computer and Communication Engineering

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

OnX Big Data Reference Architecture

A Brief Outline on Bigdata Hadoop

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

CitusDB Architecture for Real-Time Big Data

Manifest for Big Data Pig, Hive & Jaql

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Chapter 10: Scalability

Agile Business Intelligence Data Lake Architecture

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Alternatives to HIVE SQL in Hadoop File Structure

BIG DATA USING HADOOP

Big Data and Apache Hadoop Adoption:

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

How To Handle Big Data With A Data Scientist

Generic Log Analyzer Using Hadoop Mapreduce Framework

Big Data - Infrastructure Considerations

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Architecture. Part 1

Unico Enterprise Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

The IBM Cognos Platform

StorPool Distributed Storage. Software-Defined. Business Overview

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Networking in the Hadoop Cluster

SWIFT. Page:1. Openstack Swift. Object Store Cloud built from the grounds up. David Hadas Swift ATC. HRL 2012 IBM Corporation

Powerful Management of Financial Big Data

Ubuntu and Hadoop: the perfect match

Hadoop IST 734 SS CHUNG

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

Data Refinery with Big Data Aspects

Navigating the Big Data infrastructure layer Helena Schwenk

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Implement Hadoop jobs to extract business value from large and varied data sets

Information Architecture

How To Scale Out Of A Nosql Database

Next-Generation Cloud Analytics with Amazon Redshift

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

HadoopTM Analytics DDN

Using Tableau Software with Hortonworks Data Platform

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

The Microsoft Large Mailbox Vision

IBM Analytics. Just the facts: Four critical concepts for planning the logical data warehouse

While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot.

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Scalable Windows Storage Server File Serving Clusters Using Melio File System and DFS

Introduction to Cloud Computing

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Hadoop. Sunday, November 25, 12

Hadoop Cluster Applications

Big Data Analytics by Using Hadoop

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

White Paper: What You Need To Know About Hadoop

White Paper: Evaluating Big Data Analytical Capabilities For Government Use

XpoLog Competitive Comparison Sheet

Understanding How Sensage Compares/Contrasts with Hadoop

So What s the Big Deal?

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

An Hadoop-based Platform for Massive Medical Data Storage

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Data Warehousing Concepts

The Inside Scoop on Hadoop

Hadoop and Map-Reduce. Swati Gore

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Big Data and Apache Hadoop s MapReduce

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

White Paper: Hadoop for Intelligence Analysis

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

Transcription:

Data Solutions with Hadoop Reducing Costs using Open Source Software Aaryan Gupta Darshil Shah Mark Williams Contact: Aaryan Gupta agupta@it.db.com Darshil Shah dshah@it.db.com Mark Williams mwilliams@it.db.com November 16, 2014

CONTENTS EXECUTIVE SUMMARY 3 INTRODUCTION.. 4 PROBLEMS FACING CURRENT SYSTEM 4 ADVANTAGES OF HADOOP 5 COST BENEFIT ANALYSIS.. 6 CONCLUSION.. 7 REFERENCES. 8 2

EXECUTIVE SUMMARY With the advent of the internet and subsequent rise of online banking, IT infrastructure has become the backbone of the modern finance and banking systems. Deutsche Bank s current data warehouses are fragmented across many different legacy systems that have been patch worked together over the past twenty years (du Preez, 2013). The amount of trading, operations, and finance data being created is expected to keep growing, and the legacy systems are struggling to handle the increased volume. It is also increasingly important to be able to generate reports for risk assessments and audits using the most up to date information possible (Information for Success, 2012). Updating these systems in the near future will be crucial if Deutsche Bank is going to remain competitive in the era of big data. Hadoop is open source data management software that increases data processing capacity without having to convert data from legacy systems (Hadoop Deployment, 2013). Switching to Hadoop will lead to reduced costs of database infrastructure investments in the future and create a system that will remain scalable in the long-term (Kurth & Wendt, 2013). 3

INTRODUCTION Currently, Deutsche Bank holds the largest share of the foreign exchange market at 20.96% of all transactions (FX Poll Results, 2009). The enormous amount of data generated from these transactions needs to be stored in an efficient manner while still being readily accessible for analysis. Our current legacy system runs on large mainframe servers that are very costly to expand and are barely able to keep up with the current level of data being generated (du Preez, 2013). We can continue to add servers to our legacy system, but this does not address the underlying issue of the lack of scalability that exists in our current system. Many Fortune 500 companies are beginning to adopt the open source software Hadoop as their data-warehousing platform (Information for Success, 2012). Hadoop creates a platform that is simpler and cheaper to expand and much more cost effective for daily operations. Breaking up the data into smaller sizes enables the system to distribute the data across cheaper commodity servers. Hadoop also creates a much more flexible system that is able to handle different types of data while storing them efficiently and enabling the system to handle more fault tolerance because of built-in redundancy (Hadoop for Enterprise). Moving from the current legacy system to Hadoop would be a huge step forward for Deutsche Bank in data warehousing and processing while reducing long-term costs. 4

ADVANTAGES OF HADOOP There are many advantages of using Hadoop, but the four most relevant to Deutsche Bank s needs are its cost effectiveness, scalability, flexibility, and fault tolerance. Cost Effective - With the current system, any upgrades require a large investment and a lot of time to implement. Hadoop clusters are inexpensive because they run on open source software that can be downloaded from the Apache Hadoop distribution for free. Hadoop cluster can be built using commodity servers, which removes the dependency on large server hardware and further reduces costs. It also enables the use of parallel computing, which results in a decrease of the cost per terabyte of processing data (Hadoop for Enterprise). Scalable - The size of the server clusters are not an issue now, but we can add any number of nodes independent of the type of data we have. This increase in cluster processing power helps retrieve the data more efficiently while reducing the cost of further expansion as shown in Figure 1 (Hadoop for Enterprise). Figure 1 (Integrating Hadoop) Flexible - Hadoop is able to work with any schema, meaning it can handle any kind of data, structured or unstructured, from any source (Norris, 2013). The data can be joined and aggregated in many ways, making financial analysis and audits easier. This means Deutsche Bank s servers will be able to deal with 5

a variety of data, from operations to financial and trading data all on a single system that is designed to handle various data types (Hadoop for Enterprise). Fault Tolerant - Hadoop works on parallel processing. It replicates data to other nodes in the cluster. When any node in the cluster fails, the system will automatically redirect its work to another node and continue its processing without any delay, so there would be no data loss due to node failure (Nemschoff, 2013). 6

CONCLUSION With the current problems facing Deutsche Bank s data infrastructure, it is critical to upgrade to a better and more efficient methodology. Hadoop provides a solution to integrate data scattered over multiple servers into a single cluster and organizes the data effectively by providing consistent structure. In addition to the multiple instances where data has been lost from servers, Hadoop solves this issue by providing automatic redundancy. Parallel computing enables saving on various different nodes, which leads to data protection against hardware failures. The investment needed is minimal since Hadoop is open source. Compared to the high data warehousing costs the company is facing, Hadoop is capable of reducing server operation costs by 70%. The adoption of Hadoop as Deutsche Bank s data warehouse software will reduce costs in the short term as well a reducing the costs of further server expansions due to its scalability. Switching from our current legacy systems will also reduce the amount of time spent auditing and generating risk reports. These factors enable our employees to be more productive while reducing the long-term costs of infrastructure expansion and operation. 7

REFERENCES Dasteel, J. (2012, June 1). Information for Success. Retrieved November 16, 2014, from http://www.oracle.com/us/solutions/datawarehousing/dw-referencebooklet-1705275.pdf du Preez, D. (2013, February 12). Deutsche Bank: Big data plans held back by legacy systems. Retrieved November 16, 2014, from http://www.computerworlduk.com/news/applications/3425725/deutsche-bankbig-data-plans-held-back-by-legacy-systems/ Hadoop for Enterprise with IBM. (n.d.). Retrieved November 16, 2014, from http://www-01.ibm.com/software/data/infosphere/hadoop/enterprise.html Integrating Hadoop into your Enterprise IT Environment. (2014, July 11). Retrieved November 16, 2014, from http://www.slideshare.net/maprtechnologies/integrating-hadoop-intoyour-enterprise-it-environment Kurth, S., & Wendt, M. (2013). Hadoop Deployment Comparison Study. Retrieved November 16, 2014, from http://www.accenture.com/sitecollectiondocuments/pdf/accenture-hadoopdeployment-comparison-study.pdf Nemschoff, M. (2013, December 20). Big Data: 5 Major Advantages of Hadoop. Retrieved November 16, 2014, from http://www.itproportal.com/2013/12/20/bigdata-5-major-advantages-of-hadoop/ Norris, J. (2013). Saving Millions through Data Warehouse Offloading to Hadoop. Retrieved November 16, 2014, from http://www.snia.org/sites/default/files2/abds2013/presentations/mainstage/jac knorris_saving_missions_hadoop.pdf FX poll 2009: Euromoney s 31st annual FX survey. (2009, May 6). Retrieved November 16, 2014, from http://www.euromoney.com/article/2191629/whatsincluded-in-the-full-2009-fx-poll-results-press-release.html 8