Data Solutions with Hadoop Reducing Costs using Open Source Software Aaryan Gupta Darshil Shah Mark Williams Contact: Aaryan Gupta agupta@it.db.com Darshil Shah dshah@it.db.com Mark Williams mwilliams@it.db.com November 16, 2014
CONTENTS EXECUTIVE SUMMARY 3 INTRODUCTION.. 4 PROBLEMS FACING CURRENT SYSTEM 4 ADVANTAGES OF HADOOP 5 COST BENEFIT ANALYSIS.. 6 CONCLUSION.. 7 REFERENCES. 8 2
EXECUTIVE SUMMARY With the advent of the internet and subsequent rise of online banking, IT infrastructure has become the backbone of the modern finance and banking systems. Deutsche Bank s current data warehouses are fragmented across many different legacy systems that have been patch worked together over the past twenty years (du Preez, 2013). The amount of trading, operations, and finance data being created is expected to keep growing, and the legacy systems are struggling to handle the increased volume. It is also increasingly important to be able to generate reports for risk assessments and audits using the most up to date information possible (Information for Success, 2012). Updating these systems in the near future will be crucial if Deutsche Bank is going to remain competitive in the era of big data. Hadoop is open source data management software that increases data processing capacity without having to convert data from legacy systems (Hadoop Deployment, 2013). Switching to Hadoop will lead to reduced costs of database infrastructure investments in the future and create a system that will remain scalable in the long-term (Kurth & Wendt, 2013). 3
INTRODUCTION Currently, Deutsche Bank holds the largest share of the foreign exchange market at 20.96% of all transactions (FX Poll Results, 2009). The enormous amount of data generated from these transactions needs to be stored in an efficient manner while still being readily accessible for analysis. Our current legacy system runs on large mainframe servers that are very costly to expand and are barely able to keep up with the current level of data being generated (du Preez, 2013). We can continue to add servers to our legacy system, but this does not address the underlying issue of the lack of scalability that exists in our current system. Many Fortune 500 companies are beginning to adopt the open source software Hadoop as their data-warehousing platform (Information for Success, 2012). Hadoop creates a platform that is simpler and cheaper to expand and much more cost effective for daily operations. Breaking up the data into smaller sizes enables the system to distribute the data across cheaper commodity servers. Hadoop also creates a much more flexible system that is able to handle different types of data while storing them efficiently and enabling the system to handle more fault tolerance because of built-in redundancy (Hadoop for Enterprise). Moving from the current legacy system to Hadoop would be a huge step forward for Deutsche Bank in data warehousing and processing while reducing long-term costs. 4
ADVANTAGES OF HADOOP There are many advantages of using Hadoop, but the four most relevant to Deutsche Bank s needs are its cost effectiveness, scalability, flexibility, and fault tolerance. Cost Effective - With the current system, any upgrades require a large investment and a lot of time to implement. Hadoop clusters are inexpensive because they run on open source software that can be downloaded from the Apache Hadoop distribution for free. Hadoop cluster can be built using commodity servers, which removes the dependency on large server hardware and further reduces costs. It also enables the use of parallel computing, which results in a decrease of the cost per terabyte of processing data (Hadoop for Enterprise). Scalable - The size of the server clusters are not an issue now, but we can add any number of nodes independent of the type of data we have. This increase in cluster processing power helps retrieve the data more efficiently while reducing the cost of further expansion as shown in Figure 1 (Hadoop for Enterprise). Figure 1 (Integrating Hadoop) Flexible - Hadoop is able to work with any schema, meaning it can handle any kind of data, structured or unstructured, from any source (Norris, 2013). The data can be joined and aggregated in many ways, making financial analysis and audits easier. This means Deutsche Bank s servers will be able to deal with 5
a variety of data, from operations to financial and trading data all on a single system that is designed to handle various data types (Hadoop for Enterprise). Fault Tolerant - Hadoop works on parallel processing. It replicates data to other nodes in the cluster. When any node in the cluster fails, the system will automatically redirect its work to another node and continue its processing without any delay, so there would be no data loss due to node failure (Nemschoff, 2013). 6
CONCLUSION With the current problems facing Deutsche Bank s data infrastructure, it is critical to upgrade to a better and more efficient methodology. Hadoop provides a solution to integrate data scattered over multiple servers into a single cluster and organizes the data effectively by providing consistent structure. In addition to the multiple instances where data has been lost from servers, Hadoop solves this issue by providing automatic redundancy. Parallel computing enables saving on various different nodes, which leads to data protection against hardware failures. The investment needed is minimal since Hadoop is open source. Compared to the high data warehousing costs the company is facing, Hadoop is capable of reducing server operation costs by 70%. The adoption of Hadoop as Deutsche Bank s data warehouse software will reduce costs in the short term as well a reducing the costs of further server expansions due to its scalability. Switching from our current legacy systems will also reduce the amount of time spent auditing and generating risk reports. These factors enable our employees to be more productive while reducing the long-term costs of infrastructure expansion and operation. 7
REFERENCES Dasteel, J. (2012, June 1). Information for Success. Retrieved November 16, 2014, from http://www.oracle.com/us/solutions/datawarehousing/dw-referencebooklet-1705275.pdf du Preez, D. (2013, February 12). Deutsche Bank: Big data plans held back by legacy systems. Retrieved November 16, 2014, from http://www.computerworlduk.com/news/applications/3425725/deutsche-bankbig-data-plans-held-back-by-legacy-systems/ Hadoop for Enterprise with IBM. (n.d.). Retrieved November 16, 2014, from http://www-01.ibm.com/software/data/infosphere/hadoop/enterprise.html Integrating Hadoop into your Enterprise IT Environment. (2014, July 11). Retrieved November 16, 2014, from http://www.slideshare.net/maprtechnologies/integrating-hadoop-intoyour-enterprise-it-environment Kurth, S., & Wendt, M. (2013). Hadoop Deployment Comparison Study. Retrieved November 16, 2014, from http://www.accenture.com/sitecollectiondocuments/pdf/accenture-hadoopdeployment-comparison-study.pdf Nemschoff, M. (2013, December 20). Big Data: 5 Major Advantages of Hadoop. Retrieved November 16, 2014, from http://www.itproportal.com/2013/12/20/bigdata-5-major-advantages-of-hadoop/ Norris, J. (2013). Saving Millions through Data Warehouse Offloading to Hadoop. Retrieved November 16, 2014, from http://www.snia.org/sites/default/files2/abds2013/presentations/mainstage/jac knorris_saving_missions_hadoop.pdf FX poll 2009: Euromoney s 31st annual FX survey. (2009, May 6). Retrieved November 16, 2014, from http://www.euromoney.com/article/2191629/whatsincluded-in-the-full-2009-fx-poll-results-press-release.html 8