So What s the Big Deal?



Similar documents
How To Scale Out Of A Nosql Database

Introduction to Apache Cassandra

Dominik Wagenknecht Accenture

Big Data Technologies Compared June 2014

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Real Time Big Data Processing

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

INTRODUCTION TO CASSANDRA

How To Handle Big Data With A Data Scientist

Applications for Big Data Analytics

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

No-SQL Databases for High Volume Data

Open source large scale distributed data management with Google s MapReduce and Bigtable

Preparing Your Data For Cloud

Big Data and Hadoop for the Executive A Reference Guide

Structured Data Storage

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Big Data on Microsoft Platform

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Scalable Architecture on Amazon AWS Cloud

Introduction to Multi-Data Center Operations with Apache Cassandra and DataStax Enterprise

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Tap into Hadoop and Other No SQL Sources

BIG DATA TRENDS AND TECHNOLOGIES

Lecture Data Warehouse Systems

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

The Future of Data Management

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Cloud Scale Distributed Data Storage. Jürmo Mehine

NoSQL Data Base Basics

Big Data Analytics Nokia

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

HDP Enabling the Modern Data Architecture

How To Use Big Data For Telco (For A Telco)

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Big Data: Beyond the Hype. Why Big Data Matters to You. White Paper

Analyzing Big Data with AWS

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier

Big Data: Beyond the Hype

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data Analytics for Space Exploration, Entrepreneurship and Policy Opportunities. Tiffani Crawford, PhD

Introduction to Big Data Training

The Modern Online Application for the Internet Economy: 5 Key Requirements that Ensure Success

Information Builders Mission & Value Proposition

Cloud Computing Now and the Future Development of the IaaS

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Native Connectivity to Big Data Sources in MSTR 10

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Enterprise Operational SQL on Hadoop Trafodion Overview

Oracle Big Data SQL Technical Update

WA2192 Introduction to Big Data and NoSQL EVALUATION ONLY

Big Data: Beyond the Hype

Introduction to Multi-Data Center Operations with Apache Cassandra, Hadoop, and Solr WHITE PAPER

Distributed Storage Systems

Hadoop & its Usage at Facebook

This Symposium brought to you by

HDP Hadoop From concept to deployment.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Big Data: Tools and Technologies in Big Data

Apache Hadoop FileSystem and its Usage in Facebook

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Qsoft Inc

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Open source Google-style large scale data analysis with Hadoop

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Cassandra A Decentralized Structured Storage System

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

A Survey on Big Data Concepts and Tools

Introduction to NOSQL

Evaluating Apache Cassandra as a Cloud Database WHITE PAPER

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

Big Data and Data Science: Behind the Buzz Words

Hadoop & its Usage at Facebook

Hadoop IST 734 SS CHUNG

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

A Brief Outline on Bigdata Hadoop

Sentimental Analysis using Hadoop Phase 2: Week 2

Moving From Hadoop to Spark

Transcription:

So What s the Big Deal?

Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data Case Study (if we have time) Q&A Links to More Information

Introduction RhinoSource, Inc. Oracle App/Tech Consulting and Managed Services Oracle E Business Suite Oracle Business Intelligence Oracle Database (performance, partitioning and replication) Application Development and Advanced PL/SQL Development Advanced Technology Consulting and Managed Services Big Data Mobile Applications Cloud Computing CIO Level Advisory Services IT Strategy, Planning and Project Management ERP/CRM Evaluation and Implementation

WHAT IS BIG DATA?

What Makes Up Big Data? Blog posts, user comments Emails and Messaging Web server logs Instrumentation of online stores Image and video uploads Process data, such as RFID Sensor device data External data sets Census data Weather data Geographical data Shadow Data (replicated copies and change journals)

The 3 V s of Big Data Velocity Data generated at a faster rate than ever before Server logs, smart phones, sensor devices, RFID Challenge: Existing systems cannot process new data fast enough Variety Data more varied and complex Structured and unstructured Many formats: text, document, image, video Challenge: Existing databases do not handle varying data formats well Volume Orders of magnitude larger 2.5 Zetabytes of new data created in 2012 8 Zetabytes on new data projected to be created in 2015 3 Billion Internet users, 15 Billion connected devices Challenge: Existing databases do not cost effectively scale to Big Data sizes

Big Data Growth Trend Zettabytes 40% CAGR

How Much is 1 Zettabyte?

New Data by End of 2015 17 ZB of New Data By end of 2015!

SO, WHAT S THE BIG DEAL?

Can Big Data Show Us The Way? Scientific American, Dec 11: "...the rise of 'big data' [is] a trend that is striking many scientists as being on a par with the invention of the telescope and microscope." "...many experts believe we are on the cusp of opening up new worlds of inquiry."

Big Data Advantages Better, more accurate predictions Deeper, richer insights into customers, business partners and the operations Real time Big Data analytics enables faster decision making Creates competitive advantage Improves bottom line

Big Data Spending Companies have spent $4.3 billion on Big Data as of the end of 2012. Gartner predicts those initial investments will in turn trigger a domino effect of upgrades and new initiatives Valued at $34 billion for 2013, per Gartner. Over a 5 year period, spend is estimated at $232 billion.

BIG DATA TECHNOLOGIES

A Brief History of Big Data Scale Data Warehouse Distributed Big Data Cluster RDBMS RAC Time

How Do We Store Big Data? NoSQL databases store data records as key value pairs Or as triplets with a timestamp. Schema less or schema optional Values may be structured or unstructured (developer s choice). Not relational No relationships between records No join support in a NoSQL database. Does not use SQL to store and retrieve records. Highly optimized for retrieval and appending operations. High performance writes. High performance retrieval by primary key. Little functionality beyond record storage and retrieval. Highly Scalable to huge amounts of data Millions or Billions of records Partition data across many distributed, inexpensive servers for cost effective scalability and availability Must trade off between Availability versus Consistency (CAP Theorem).

Popular NoSQL Databases Key Value Stores Column Oriented Databases Graph Databases Document Databases

Why Not Relational for Big Data? Transforming and loading data into RDBMS requires extensive preprocessing of data into a pre defined schema Doesn t work well for semi structured and unstructured data Can take more time than is available before next batch must be loaded Joining multiple data sets at query time is an expensive operation RDBMS scaling must be done vertically to larger and more expensive servers and storage solutions RDBMS clustering requires expensive networking and shared storage infrastructures Fiber Channel, Infiniband, SAN, NAS Challenging to distribute data across data centers Replication strategies are add ons and complex Strict Consistency requirement is enforced at the cost of write performance and availability (CAP Theorem)

Dr. Brewer s CAP Theorem CA: Pick 2 Oracle RAC RDBMS CP: BigTable Hadoop/Hbase MongoDB Oracle NoSQL Redis AP: Cassandra CouchDB Dynamo Riak SimpleDB

Scalability Comparison (Logarithmic Scale) 100000 21 PB, 2000 Nodes at Facebook 10000 1000 100 10 TB, 100 Nodes at CraigsList 71 TB, 48 Nodes at Amazon 300 TB, 400 Nodes at Digital Reasoning Terabytes Server Nodes 10 1 MongoDB Oracle RAC Cassandra Hadoop RAC

Scalability Comparison (Linear Scale) 25000 21 PB, 2000 Nodes at Facebook 20000 15000 10000 10 TB, 100 Nodes at CraigsList 71 TB, 48 Nodes at Amazon 300 TB, 400 Nodes at Digital Reasoning Terabytes Server Nodes 5000 0 MongoDB Oracle RAC Cassandra Hadoop RAC

Feature Comparison Cassandra Best of the NoSQLs for Cross Data Center Replication and High Availability Known to scale to 100 s of Terabytes (but theoretically to Petabytes) Tunable Consistency at operation level for writes and reads. Availability model (AP). Primary and Secondary Indexes Queries are Real Time (CQL, Thrift) No Join Support Masterless Peer to Peer Ring Architecture = No S.P.O.F. Provides most cost effective HA and scalability of the NoSQLs Written in Java Minimum of 3 nodes recommended. Easy to install and setup on commodity hardware. Hadoop/HBase The current Gold Standard of the NoSQLs for Data Analysis Known to scale to Petabytes (1000 s of Terabytes) Consistency model (CP) Hadoop Queries are Batch (MapReduce). HBase provides real time queries similar to Cassandra. Joins are Possible Master Slave Architecture = S.P.O.F. (Name/JobTracker Node) Written in Java Minimum of 5 nodes recommended. More challenging installation and setup. Warm Standby and Shared Storage Required for High Availability Failover, so higher infrastructure costs.

Best of All Worlds DataStax Enterprise Cassandra Real Time Database Peer to Peer HA Architecture Cross Data Center Replication Real Time, Low Latency Queries Hadoop A Analytics Map/Reduce, Hive, Pig (Joins) Solr Search Full Text Search Rich Document Handling (Word, PDF)

Plus Cluster Management

Current Big Data Challenges Integrating Big Data with existing databases and BI/reporting systems. JDBC, ODBC sqoop Security and Encryption DataStax Enterprise 3.0 (In Beta) Transparent Data Encryption Internal and External Authentication Data Auditing

IDENTIFYING BIG DATA OPPORTUNITIES

Big Data Use Cases Context for Interactions and Transactions Reward Points Warranty Policies Social media chatter Survey response feedback Website requests Connection with Outside Patterns Weather Data Demographic Data Geographical Data Government Compliance Data Improving Disaster and Outage Response Times by Spotting Trends Compliance Checks and Audits Competitive Insights into How Your Products and Services (and your competition s) are used and perceived in the marketplace. Database Infrastructure Behind Mobile and Web Applications

Great Places to Look for Big Data Opportunities Server Logs Web server and app server logs Call center/phone system logs Product Data Performance data Sensor data Positional data Streamed live or captured in Log files / Data files Current RDBMS Archive Purge Strategies What data are you deleting every day/month/year? Financial Data, Operational Data, Customer Interactions

Implementing Big Data Identify "Game Changing" Big Data opportunities. Define a business case. Identify existing business and functional capabilities. Augment existing capabilities with 3rd party assistance. Conduct low cost Proof of Concept project to demonstrate feasibility.

Low Cost Proof of Concept Take advantage of a cloud platform like Amazon Web Services (AWS) and Amazon EC2. Run a multi node cluster for less than $25/day. Get started instantly. Have a cluster up and running in only a few hours. NoSQL technologies are perfectly suited for the cloud deployed model. Amazon Machine Images (AMIs) exist for most NoSQL products that can be started in just a few minutes. You can make it as secure as you need it to be.

Low Cost Proof of Concept Now that you have a cluster up and running: Load up some test data. (Check out sqoop.) Get your HiveQL book in hand and start doing some analysis. Delete the servers once you are done. Only pay for the time the servers are running. You can always bring the cluster in house for production, but you might find out it s more cost effective to leave it in the Cloud!

(If we have time) BIG DATA CASE STUDY

Client Overview Mobile social networking startup Focused on families with kids Launching in Q1 2013 Currently in Stealth Mode pending launch the first week of March, 2013 Big Data Use Case: Infrastructure behind mobile app

The Challenge Big Data application Semi structured and unstructured data Low latency (<100ms) for user experience 24 x 7 high availability Cloud deployment (Amazon AWS) Analytical capability required

The Solution DataStax Enterprise Big Data Database Cluster Cassandra database for low latency reads and writes Cluster architecture for high availability Tunable read and write consistency Integrated Hadoop workload support for analytics Integrated Solr workload support for search feature DataStax OpsCenter tool for cluster management Benefits High performance reads and writes = good customer experience Only single cluster required for Cassandra, Hadoop and Solr Commercial grade support Cost effective solution Fast deployment (30 days)

Technical Details Installed DataStax Enterprise 2.2.1 on Amazon AWS 3 x M1.Large Nodes Will double to 6 nodes later in the year Each node will hold ~800GB of data Implemented monitoring and alerts Cluster stats collected every 15 seconds Stats stored in db and graphed Amazon SNS for notifications (email and SMS)

Amazon AWS and EC2

OpsCenter Cluster Management

Cluster Ring View

Performance Monitoring

Customized Dashboards

More Custom Dashboards

Q&A

More Reading www.rhinosource.com/bigdata.html

Thank you!