The big data revolution

Similar documents
How To Handle Big Data With A Data Scientist

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Big Data, Big Traffic. And the WAN

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

How to Enhance Traditional BI Architecture to Leverage Big Data

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ANALYTICS BUILT FOR INTERNET OF THINGS

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

The Future of Data Management

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Massive Cloud Auditing using Data Mining on Hadoop

BIG DATA What it is and how to use?

Actian SQL in Hadoop Buyer s Guide

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

How To Scale Out Of A Nosql Database

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

There s no way around it: learning about Big Data means

White. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014

Turn your information into a competitive advantage

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Open source Google-style large scale data analysis with Hadoop

Microsoft Big Data. Solution Brief

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data and Apache Hadoop Adoption:

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Architectures for Big Data Analytics A database perspective

Information Architecture

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

IBM System x reference architecture solutions for big data

Agile Business Intelligence Data Lake Architecture

GigaSpaces Real-Time Analytics for Big Data

The 3 questions to ask yourself about BIG DATA

Advanced Big Data Analytics with R and Hadoop

1 Performance Moves to the Forefront for Data Warehouse Initiatives. 2 Real-Time Data Gets Real

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Apache HBase. Crazy dances on the elephant back

The Enterprise Data Hub and The Modern Information Architecture

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Hadoop Big Data for Processing Data and Performing Workload

Deploying an Operational Data Store Designed for Big Data

Big Data With Hadoop

Hadoop and Map-Reduce. Swati Gore

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Interactive data analytics drive insights

Data Refinery with Big Data Aspects

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

ETPL Extract, Transform, Predict and Load

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data at Cloud Scale

Luncheon Webinar Series May 13, 2013

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Big Data and Healthcare Payers WHITE PAPER

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

Are You Ready for Big Data?

Why DBMSs Matter More than Ever in the Big Data Era

Hadoop. Sunday, November 25, 12

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Investor Presentation. Second Quarter 2015

The Rise of Industrial Big Data

High performance ETL Benchmark

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

IBM Netezza High Capacity Appliance

Making big data simple with Databricks

BIG DATA-AS-A-SERVICE

Understanding traffic flow

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS

Predictive analytics with System z

Workshop on Hadoop with Big Data

Big Data: Beyond the Hype

White Paper Storage for Big Data and Analytics Challenges

Benchmarking Hadoop & HBase on Violin

Modern Payment Fraud Prevention at Big Data Scale

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform:

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Are You Ready for Big Data?

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

RevoScaleR Speed and Scalability

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

BIG DATA TOOLS. Top 10 open source technologies for Big Data

White Paper: What You Need To Know About Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

Bringing Big Data into the Enterprise

Transcription:

The big data revolution Friso van Vollenhoven (Xebia) Enterprise NoSQL Recently, there has been a lot of buzz about the NoSQL movement, a collection of related technologies mostly concerned with storing and processing large amounts of data. Although these technologies initially emerged from large web companies and social networking companies that have to deal with huge amounts of data on a daily basis, they are also becoming more relevant and interesting for enterprises. There are essentially three reasons for this: The volume of data that organizations gather has grown explosively in recent years; When dealing with large data volumes, several NoSQL technologies can be more cost effective than existing proprietary solutions; Some NoSQL technologies are better suited for large-scale data mining and business intelligence than the traditional relational databases that are widely used today. In order to understand the advantages of NoSQL, we need to examine what these technologies do and how they solve problems differently than the traditional data storage and processing solutions. Traditionally, database servers are large machines with lots of memory and processing power. If data volumes get too large for the machines, then the solution is simply to add even more memory and processing power to cope with the increased load. But this scale-up approach has limits; a machine cannot be scaled up indefinitely. One solution to this problem is to scale out instead of up. This means adding another database server and splitting the database into two parts. Scaling out, however, poses two new problems. First, relational databases were not built with scaling out in mind, and distributing database transactions over more than one machine has performance implications that are usually solved by using very fast, specialized and expensive network hardware. Secondly, scaling out with proprietary software is expensive. Each additional machine adds additional licensing costs, which can be substantial. What the web companies needed was a solution that can scale out to any number of machines with normal, inexpensive hardware and low additional licensing costs. A solution found by many was to abandon the relational database model, to use only standard commodity hardware and to use only open source software, which has no licensing costs. The software that emerged from this strategy provides an affordable way to deal with extremely large datasets. 1

150.000 initital hardware purchase 112.500 75.000 37.500 0 1 2 3 4 5 6 7 8 9 10 number of cluster machines Commodity servers High end hardware With proprietary software, scaling out is very expensive. Besides the cost of the hardware itself, each additional piece of hardware also incurs additional software license fees. Moreover, scaling out may simply not be an option for many proprietary solutions. In contrast, with open source software and scaling out, the only variables are hardware purchase, rack space and power consumption. The ability to work with large data volumes using NoSQL technologies not only solves an important technical problem, but also creates many new opportunities. Having the power to process and analyze all the data an organization owns in a reasonable amount of time creates an exciting possibility: the ability to extract information and conclusions that were not available previously due to technical limitations. For example, ten years ago it would have been unthinkable that a web search company would be able to accurately predict flu epidemics. Today, Google is capable of doing just that, based on data about search terms and geographical distribution. Similarly, organizations can now use this technology to gain new insights from their data, such as customer behavior analytics, risk modeling, fraud detection or fine-grained marketing information. Two techniques have traditionally been used when dealing with data volumes too large for a single machine: archiving and sampling. Archiving is the practice of retaining only the most recent data for live querying; older data is pushed to off-line storage where it cannot be accessed directly. Sampling is the practice of analyzing data by looking at only a fraction of all data points in order to obtain a much smaller, hopefully representative set on which a conclusion is then based. These solutions have become obsolete. Today, it has become technically possible to analyze every single transaction, customer interaction, fraud case 2

and mouse click on a website that occurred during the past several years, and then base conclusions on all of that data, instead of just a sample. For example, organizations can now think about building customer behavioral models that predict when customers are about to move to a competitor, and then take preventive action. This is possible even if there are millions of customers across the globe. Such possibilities make NoSQL and big data analytics relevant for enterprise, and not just a technological solution to a problem that is only relevant for Google or Facebook. Case study: Hadoop and HBase at RIPE NCC Processing hundreds of millions of records per day The RIPE NCC is one of five Regional Internet Registries (RIRs) providing Internet resource allocations, registration services and coordination activities that support the operation of the Internet globally. They are responsible for allocating ranges of IP addresses to organizations. Another primary mission of RIPE NCC is to provide high quality, near real-time information about the operation of the internet, such as active latency and routing measurements. These measurements produce hundreds of millions of data points per day, resulting in large volumes of data that must be stored and indexed continuously. RIPE NCC must be able to deal with such volumes of data and convert the data stream into valuable information in real time. Together with Xebia, RIPE NCC realized a system capable of handling millions records per day and achieving sustained throughputs of more than 10,000 records per second from multiple data sources across the globe using only open source software, commodity server hardware and network equipment. New data sources can be added at any time. Moreover, storage and processing capacity can be easily scaled out when required, even without downtime. The solution deals with network measurement data that is collected from sources at data centers across the world and sent to a storage network at RIPE NCC s Amsterdam based data center. From there, all data is moved onto a Hadoop cluster, where a series of Hadoop jobs perform further processing, which is primarily a extract-transform-load (ETL) to push the data into a database that can be used for querying from a web application. The system at RIPE NCC deals with data collected from 15 measurement points around the world. Every eight hours it processes a batch of 160 million updates in 3.5 hours. In addition, every five minutes thousands of additional updates come in. On average around 4,000 updates per second are processed around the clock with peaks of 12,000 updates/second. In total 25TB of source data are indexed, and this amount is growing rapidly. 3

RIPE NCC deals with internet measurement data, and internet traffic roughly doubles every year, so scalability is a very important factor for this setup. Initially, RIPE NCC deployed a system backed by a relational database to analyze and store the measurement data. This system was only capable of holding three months of history available for querying and searching. All older data had to be archived and was not directly accessible. Full data analysis was therefore not an option. By adapting big data-ready 4

technology, RIPE NCC is now able to use the full potential of the data sets it owns. Not only is it possible to query and search all ten years of historical data, but ad hoc analysis of the full data set has also become a realistic option. When the problem of high volume ETL workloads involving vast amounts of source data is addressed, high-end computing hardware is traditionally the first solution that comes to mind. We often assume that it takes specialized networking and storage equipment and high-end databases to deal with thousands of records each second. Today, however, this assumption is not always correct. By using the open source technologies Hadoop and HBase, businesses can harness the power of the MapReduce parallel processing algorithm, which was originally implemented by Google to build their enormous search indexes. Whereas traditional batch processing systems depend on very fast machines to sequentially process large amounts of data in acceptable time, Hadoop MapReduce is an entirely different approach that splits data into smaller pieces and processes all pieces in parallel. This parallelism is key to its success. This can lead to significantly improved processing times even when using standard server hardware, since each individual piece does not need to be processed at lightning speed as long as all the pieces are being done at the same time. This is achieved by using clusters of machines running Hadoop. Hadoop is designed to run on any number of commodity server machines in parallel, ranging from a few machines to hundreds or even thousands. Additional machines can easily be added while the cluster is running, and Hadoop will immediately start using the additional capacity as it becomes available. As a result, capacity can be expanded simply by adding machines. This scale-out approach makes it realistic for IT to accommodate the demand for expansion as it arises. In a traditional scale-up scenario, capacity is usually expanded by replacing a very big machine with an even bigger one, which makes the existing solution unusable. Hadoop is open source software with a significant installed base and many corporate users who run the software in production and depend on it for daily operation. Setups ranging from five to five hundred machines are powering BI solutions for the banking and insurance sector, online retailers and government agencies. Besides the Hadoop open source community, professional support and user and administration training are available from several providers. Moreover, because the software is open source, you are never locked into a single vendor or supplier for support or licensing. 5

Conclusion Besides solving a technical problem that relational databases are often unable to tackle, NoSQL technology especially Apache Hadoop for data crunching is providing new opportunities. Big data business intelligence has emerged as a new field. Traditional business intelligence is often database-driven. This limits the investigation options for the data that is imported into existing databases or data warehouses. However, in most organizations, even more data is available. Historical data is usually archived and is therefore inaccessible for querying. Audit trails and other sources of information are often never imported into databases in the first place. But with Hadoop MapReduce due to its highly parallelized operation it is possible to harvest vast amounts of data in an acceptable time. This makes full data analysis a realistic option, which can lead to insights driving new business. Hadoop enables enterprises to essentially skip the database, and to query large amounts of source data directly by using a cluster that processes large amounts of data in just minutes. On the technical side, Hadoop and other NoSQL related technologies have shown themselves to be production-ready and mature software solutions that are now available for enterprise use. With many companies offering professional services and support for the several NoSQL technologies, the risk of early adoption is manageable. Combined with the competitive advantage that can come from being a early adopter, NoSQL and big data analysis should be considered by organizations that acquire or own substantial amounts of data of any kind. Xebia Nederland b.v., Utrechtseweg 49, 1213 TL Hilversum, The Netherlands, Tel: +31(0)35-5381921, www.xebia.com, info@xebia.com 6