Big Data? Definition # 1: Big Data Definition Forrester Research

Similar documents

Hadoop. Sunday, November 25, 12

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Hadoop Ecosystem B Y R A H I M A.

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TECHNOLOGY. Hadoop Ecosystem

How To Scale Out Of A Nosql Database

How To Handle Big Data With A Data Scientist

Comprehensive Analytics on the Hortonworks Data Platform

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Manifest for Big Data Pig, Hive & Jaql

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop implementation of MapReduce computational model. Ján Vaňo

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Chase Wu New Jersey Ins0tute of Technology

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Open source Google-style large scale data analysis with Hadoop

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Hadoop IST 734 SS CHUNG

Search and Real-Time Analytics on Big Data

MySQL and Hadoop. Percona Live 2014 Chris Schneider

The Future of Data Management

Dominik Wagenknecht Accenture

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

BIG DATA What it is and how to use?

Cloudera Certified Developer for Apache Hadoop

Bringing Big Data to People

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Native Connectivity to Big Data Sources in MSTR 10

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

I/O Considerations in Big Data Analytics

Application Development. A Paradigm Shift

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Luncheon Webinar Series May 13, 2013

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Apache Hadoop: Past, Present, and Future

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Oracle Big Data SQL Technical Update

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Data processing goes big

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Accelerating and Simplifying Apache

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Large scale processing using Hadoop. Ján Vaňo

Big Data Introduction

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Constructing a Data Lake: Hadoop and Oracle Database United!

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Upcoming Announcements

Virtualizing Apache Hadoop. June, 2012

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Workshop on Hadoop with Big Data

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Massive Cloud Auditing using Data Mining on Hadoop

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Peers Techno log ies Pv t. L td. HADOOP

A Big Data Storage Architecture for the Second Wave David Sunny Sundstrom Principle Product Director, Storage Oracle

Advanced In-Database Analytics

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

<Insert Picture Here> Big Data

Big Data Analytics for Space Exploration, Entrepreneurship and Policy Opportunities. Tiffani Crawford, PhD

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

A very short Intro to Hadoop

Transcription:

Big Data

Big Data? Definition # 1: Big Data Definition Forrester Research

Big Data? Definition # 2: Quote of Tim O Reilly brings it all home: Companies that have massive amounts of data without massive amounts of clue are going to be displaced by startups that have less data but more clue."

Big Data? Definition # 3: My Oversee.net Big Data Definition: when the size of the data itself became part of the problem. Mobile, Social, Profiling, Logs, Transactional petabytes of data when the unstructured and dispersed data could not be processed using conventional methods as an essential component of a test and learn mind-set The ability to rapidly test ideas fundamentally changes the company's mindset and approach to innovation

Oversee.net Leader in Online Performance Marketing and Leading Websites Over 1,000,000 domain names owned Monetization of over 8,000,000 domain names Dominant owner of premium properties Large PPC and CPV network giving advertisers highly targeted access to 250 million unique visitors Owned and Operated Lead Generation Sites in Travel, Retail, and Consumer Finance producing over 14 million high-converting leads Strong Brands in all Marketplace Segments DomainSponsor SnapNames Moniker LowFares ShopWiki CreditCards.org Page 5

Why the Big Data Evolution? System Logs Started with Syslog-ng As our volume grew, it didn t scale Resources overwhelmed Lost Data

Why the Big Data Evolution? Data, Data, and More Data 280 million Unique Visitors/Month 700 million Landing Pages/Month 50 million ad clicks/month 3,000 requests/second Daily raw data processing 2 Terabytes Total data storage requirements over 200 Terabytes Process data daily down to clicks, visits, searches, products, leads, emails, sessions, profiles.if raw data saved beyond 30 days, would be in the Petabytes!

Data Analysis at Overee.net Keyword and Lander optimization We can anticipate and promote seasonal keywords. For example, weight loss is a popular search term in January (New Year s Resolution). Categorizing our keyword data also allows us to make better lander selections. Weight loss is in the Beauty category, which does best on the Gym and Scale landers.

Business Intelligence tools In an ever-changing Internet environment, we track and anticipate industry trends, such as shifts in browser use and the growth of mobile traffic. Emerging markets overseas sparked an initiative to focus on the major countries of Europe. New browsers bring new features, which may impact the effectiveness of our product.

Mapping User Input to Locations using Explicit Semantic Analysis Develop location recommendation system based on semantic attributes System should be open-ended, i.e. impose minimum restrictions on user input Allow for system to update itself over time Be language independent

Inspiration: ESA

Data Types Unstructured Data Web Crawling Textual Documents Bitmap Data Blog Posts Emails Images (Semi) Structured Data Web Logs Data Models XML Files Software Code Typical database teams, think of structured data only. Today s reality, forces us to deal with a lot of unstructured data.

Netezza Multi-terabyte storage Over forty TB s of storage capacity readily available for fast query response Our production system crunches on avg 92k queries/day Queries counting unique counts across multiple years can run from raw facts and return under 1 minute (no need to build multiple aggregate tables) Near real-time reporting due to fast processing window Our system supports EDW back-end processing and external online user support [can be achieved with Netezza] Most online queries (covering multiple months) are returned within seconds Very limited DBA/system support (less than 1 DBA for routine support)

Data Flow Architecture at Oversee.net PUBLISHER REPORTING SALES REPORTING TOOL ANALYTICS DIRECT QUERYING WEB SERVERS Internet Traffic Application Access Direct User Access FILERS PERL LOG PROCESSING Initial Log Processing MYSQL LEGACY DATABASE Additional Processing Netezza 10200 25TB Data Replication Netezza TF6 16TB Turing Scribe Mid Tier Stats DB Dimensionalization Model for Reporting Additional Post Processing Aggregate Tables for Sub-second Publisher Stats

Data Flow Architecture at Oversee.net CUSTOMER REPORTING ANALYTICS DIRECT QUERYING WEB SERVERS Internet Traffic MICROSTRATEGY REPORTING Application Access Direct User Access FILERS PERL LOG PROCESSING Complete Log Processing Netezza 10200 25TB Data Replication Netezza TF6 16TB Canned Reports Dashboards and ad hoc access Turing Scribe Mid Tier Remove legacy MySQL processing, introduce MicroStrategy BI tools.

Data Flow Architecture at Oversee.net ADVANCED ANALYTICS ANALYTICS PUBLISHER REPORTING WEB SERVERS Hive and Pig Access Direct Netezza Querying Canned Reports Dashboards and ad hoc access Internet Traffic MICROSTRATEGY REPORTING Application Access LOG COLLECTION AND Log Processing PROCESSING Turing Scribe Mid Tier HIVE DATABASE HADOOP HDFS Dimensional Data Migration Netezza Introduce Hadoop cluster to handle future data volume growth (processing and storage) and to support advanced analytics. Maintain Netezza appliances for fast querying.

Why Hadoop and Hive? Hadoop as technical Opportunity Store data as-is, in log files, etc., without first loading it into some database structure, and still be able to readily analyze it. Parallelize without having to manage the details of parallelization Can optionally use available, heterogeneous hardware Hadoop as Business Opportunity Leverage data (especially historical data) by using more data than in just Netezza alone Full user agent strings Keyword impressions Turing Internals Conduct additional business analysis: Keyword impressions Hive Simplifies managing and querying structured data built on top of Hadoop Automates the Map Reduce code HDFS for Storage Metadata in an RDBMS Allows Analysts easier access to data

Why Hadoop and Hive? Data Locality - Hadoop makes use of this principle ( unlike SGE+Lustre ) Hadoop tries to send the computation to where the data lives. It is easier to send a program than moving big data. This minimizes network calls and speeds up computation. Redundancy Hadoop Data gets broken down into 64 MB blocks. Every block gets copied and gets distributed all over the file system ( HDFS ). Thus if a disk fails - the data is not lost. One can also duplicate jobs on the cluster. We can rely on other job s results if one fails. Flexible API - MapReduce API is quite simple to use and well abstracted. For instance, programmers can simply call read data() function without having to know what happens behind the scenes. A fully fledged framework - Large community support and has a big growing ecosystem Hbase, Hive, Pig, Zookeeper, Avro, Chukwa

Hadoop / Hive Architecture Offers highly scalable and fault-tolerant processing of very large data sets. Map Reduce with Hadoop Runs on top of HDFS (Hadoop Distributed File System) Query predicate push down via server side scan and get filters Optimizations for real-time queries Supports row locks (built using ZooKeeper) A high performance Thrift gateway HTTP supports XML, Protobuf, and binary Cascading, hive, and pig source and sink modules jruby-based (JIRB) shell No single point of failure Rolling restart for configuration changes and minor upgrades Random access performance is like MySQL http://hadoop.apache.org/hive

Integration Techniques Elephant Bird Binary logs formatted as protocol-buffer messages. Elephant Bird a library that generates Hadoop and Hive bindings to read these messages. Mahout Library of machine learning and data mining algorithms built on Hadoop. Domain / User Categorization Travel Recommendations Netezza Scoop and Microstrategy Connectors into Hadoop Cloudera Providing support for implementation Map Reduce using Perl - the HadoopThriftServer class is part of the /usr/lib/hadoop/contrib/thriftfs/hadoop-thriftfs-0.20.2+737.jar0.20.2+737.jar file. Profiling - daily aggregates available for 60 days in raw form for processing to allow optimization stored in Hadoop in broken into fragments to allow for pattern recognition algorithms.

Safeguarding Your Data Multiple levels of redundancy across all systems (Netezza and Hadoop). Backup mission critical data. NameNode work is being done to safeguard NameNode. Losing NameNode means losing data. For the security of the Hadoop cluster you should encrypt the data using single-key block encryption and transport encryption for HDFS.

Future Data Efforts at Oversee Dwell time / repeat visit reporting at session and visitor level Full behavioral profiling integrated with demographics Full machine learning integration with both domain and keyword categorization Traffic channel integration with profiling Social profiling integrated to understand traffic Real-time profiling integrated with ad placement

Questions? Debra Domeyer Co-President and Chief Technology Officer Oversee.net 515 S. Flower St, Suite 4400 Los Angeles, CA 90071 Main: 213.408.0080 ddomeyer@oversee.net www.oversee.net