Using Hadoop to Expand Data Warehousing



Similar documents
Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Big Data Technologies Compared June 2014

Building Your Big Data Team

Open source Google-style large scale data analysis with Hadoop

Big Data and Its Impact on the Data Warehousing Architecture

2009 Oracle Corporation 1

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

Oracle Big Data SQL Technical Update

Hadoop: Embracing future hardware

Quantcast Petabyte Storage at Half Price with QFS!

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Cost-Effective Business Intelligence with Red Hat and Open Source

Oracle BI Roadmap & Visual Analyzer Ljiljana Perica, Oracle Business Solution Leader Ljiljana.perica@oracle.com

Hortonworks Data Platform Reference Architecture

IBM Netezza High Capacity Appliance

KNIME & Avira, or how I ve learned to love Big Data

Microsoft Analytics Platform System. Solution Brief

Inge Os Sales Consulting Manager Oracle Norway

Oracle Database 12c Plug In. Switch On. Get SMART.

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

SAP Real-time Data Platform. April 2013

The Evolution of Microsoft SQL Server: The right time for Violin flash Memory Arrays

Advanced In-Database Analytics

High Availability Databases based on Oracle 10g RAC on Linux

Building a Scalable Storage with InfiniBand

EMC BACKUP MEETS BIG DATA

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

Big Fast Data Hadoop acceleration with Flash. June 2013

Hadoop Size does Hadoop Summit 2013

MapR Enterprise Edition & Enterprise Database Edition

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Parallel Data Warehouse

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Providing Self-Service, Life-cycle Management for Databases with VMware vfabric Data Director

Main Memory Data Warehouses

Hadoop on the Gordon Data Intensive Cluster

Real Life Performance of In-Memory Database Systems for BI

<Insert Picture Here> Big Data

SQL Server Business Intelligence on HP ProLiant DL785 Server

EMC Virtual Infrastructure for Microsoft Applications Data Center Solution

The Future of Data Management

Networking in the Hadoop Cluster

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

HP Oracle Database Platform / Exadata Appliance Extreme Data Warehousing

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

Aaron Werman.

Advanced Big Data Analytics with R and Hadoop

The Inside Scoop on Hadoop

SQL Maestro and the ELT Paradigm Shift

Modernizing Your Data Warehouse for Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Open source large scale distributed data management with Google s MapReduce and Bigtable

Using RDBMS, NoSQL or Hadoop?

HadoopTM Analytics DDN

Safe Harbor Statement

I/O Considerations in Big Data Analytics

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Cisco IT Hadoop Journey

PureSystems: Changing The Economics And Experience Of IT

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

The Pros and Cons of Data Warehouse Appliances

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

Platfora Big Data Analytics

Bringing Big Data into the Enterprise

Beyond Conventional Data Warehousing. Florian Waas Greenplum Inc.

The Design and Implementation of the Zetta Storage Service. October 27, 2009

Big Data Analytics - Accelerated. stream-horizon.com

HDP Enabling the Modern Data Architecture

Virtual Desktop Infrastructure (VDI) Overview

Advances in Virtualization In Support of In-Memory Big Data Applications

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

A HIGH-PERFORMANCE, SCALABLE BIG DATA APPLIANCE LAURA CHU-VIAL, SENIOR PRODUCT MARKETING MANAGER JOACHIM RAHMFELD, VP FIELD ALLIANCES OF SAP

Virtualizing Apache Hadoop. June, 2012

Virtuoso and Database Scalability

APACHE HADOOP PLATFORM HARDWARE INFRASTRUCTURE SOLUTIONS

IBM PureData System for Transactions. Technical Deep Dive. Jonathan Rossi, PureSystems Specialist

James Serra Sr BI Architect

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Cisco Data Preparation

Il mondo dei DB Cambia : Tecnologie e opportunita`

Big Data Processing: Past, Present and Future

Protecting Big Data Data Protection Solutions for the Business Data Lake

SUN ORACLE DATABASE MACHINE

Siemens SAP Cloud - Infrastructure as a Service

Transcription:

Using Hadoop to Expand Data Warehousing Mike Peterson VP of Platforms and Data Architecture, Neustar Feb 28, 2013 1 Copyright Think Big Analytics and Neustar Inc.

Why do this? Transforming to an Information and Analytics Company We needed to capture ALL the data We had to do this and strive to keep the effective cost of a TB under $250 We decided to commit to Open Source 2 Copyright Neustar Inc.

The Original Business Case 100 TB of INCREMENTAL Data Storage 3 Year Cost Millions US $ $9.6 $9.6 $6.3 $0.2 Hadoop Big Data Netezza Teradata Oracle 3 Copyright Neustar Inc.

Big Data Warehouse Challenges Cost to store unstructured data Poor response time to changing BI needs Data Warehouse access for departments Goals Integrate unstructured data with EDW Predictive analytics based on data science Data Governance in place to manage access 4 Copyright Neustar Inc.

Technology Evolution Goals» New data platforms unlock innovation» Not just package implementation» More open source technology» Rethink assumptions» Increase technology skills» Focus data teams 5 Copyright Neustar Inc.

Working Together You need partners to succeed» Thought Leaders in how» Teach not just do» Work at the executive level» Expertise in Technologies» Trusted partner» Collaborative Teams» Open source leader» Invested in client success» Price/performance 6 Copyright Think Big Analytics and Neustar Inc.

QFS Our most recent addition» Benefits to QFS» HDFS best practice is 3X replication» QFS give equivalent protection from Data Loss with a 1.5X replication factor.» Neustar nearly doubled the retention period of our largest data source (from 1 year to 2 years) with limited integration effort and no incremental Hardware Cost» QFS has provided a 30% improvement in Write performance and ~ 10% improvement in Read performance over HDFS» QFS was based on the Kosmos File System (KFS)» Uses Reed- Salomon encoding which is effectively Software Raid» QFS creates 9 stripes and can reconstruct the data from any 6 stripes» Been using QFS under Hadoop since early December 2012. It has been very stable» Some trade offs» Need to implement with 10Gbit switching since QFS moves away from HDFS s localization of Data» Group permissions work significantly different than HDFS 7 Neustar Inc. Confidential

Our Architecture

Hadoop Cluster Q4 2012 Configuration:» 128 Data Nodes (SuperMicro Fat Twin) + Primary and Backup Name Node» Hortorworks Hadoop Distribution» Will utilize both HDFS and QFS (Quantcast File System)» QFS based on Open Source KFS (Kosmos File System)» QFS used initially for the largest data source (DNS queries ) Each Data node:» OS: centos 6.3» 2 sockets / 8 cores per socket» 2080 total cores» 64 GB memory» 8.3 TB total memory» 8-3TB SATA drives / node» 3 PB of raw storage» 1.5 PB of usable storage» 10 Gbit Ethernet 9 Copyright Neustar Inc.

Postgres Cluster Q4 2012 Configuration:» 9 Database servers (SuperMicro)» Customized Stato / Postgres (forked Postgres 9.0.3) Database nodes:» OS: centos 6.3» 2 sockets / 8 core per socket» 144 total cores» 128 GB memory / Node» 1 TB total memory» 16-3TB SATA drives per server» Raid 6» Total usable storage in cluster is 345 TB» LSI Cachecade controller for database acceleration» 2-500 MB SSD» IO improvement cut server count in half» 10Gbit Ethernet 10 Copyright Neustar Inc.

Open Source Successes Created a +1 Data Capture Platform with Hadoop» Initial investment of 500K Committed to the Hortonworks Suite» Staying true to open source Migrated Netezza Data Warehouse to Postgres» Eliminated $400K of annual support cost and $3 MM for a Tech Refresh Migrated Oracle Data Warehouse to Postgres» Eliminated 48 Oracle Licenses and associated support Bulk of ETL Migration was performed with Impetus (off-shore partner)» 7 years worth of code was migrated in 9 months» 10 Impetus resources 11 Copyright Neustar Inc.

System Scale» Query volume is ramping» 10,000 Map Reduce processes/day» Ingesting over 40B rows a day» 1.5TB with 7x compression» Storage utilization at 60%» Core utilization spikes when processing Machine Learning Algorithms» 100% capture of multiple large product data sets 12 Copyright Neustar Inc.

By the Numbers over last 2 years Thousands 5,000 4,000 3,000 2,000 1,000 Total Cost of Ownership Cost Savings: 3 to 4 MM 0 2012 2013 2014 2015 Open Source DB Hadoop Traditional 13 Copyright Neustar Inc.

Challenges and Lessons» Open Source is a commitment to your technologist» DevOps is fuzzy in practice» Commodity components break and aren t supported» Storage still isn t free» It takes more time to change attitudes than technology» It is ok to try some scary things (like QFS)» Data Science» All that said, we are having a lot of fun 14 Copyright Neustar Inc.