KNIME & Avira, or how I ve learned to love Big Data



Similar documents
How Companies are! Using Spark

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Oracle Big Data SQL Technical Update

Oracle Database 12c Plug In. Switch On. Get SMART.

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data on Microsoft Platform

Open source Google-style large scale data analysis with Hadoop

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

The Future of Data Management

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Implement Hadoop jobs to extract business value from large and varied data sets

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

AtScale Intelligence Platform

Information Architecture

Big Data Are You Ready? Thomas Kyte

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Analytics on Spark &

Real Time Big Data Processing

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Testing Big data is one of the biggest

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social

How To Handle Big Data With A Data Scientist

Using Hadoop to Expand Data Warehousing

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Native Connectivity to Big Data Sources in MSTR 10

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

HDP Enabling the Modern Data Architecture

Moving From Hadoop to Spark

Information Builders Mission & Value Proposition

Apache Hadoop: Past, Present, and Future

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

KNIME opens the Doors to Big Data. A Practical example of Integrating any Big Data Platform into KNIME

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Comprehensive Analytics on the Hortonworks Data Platform

Big Data Analytics Nokia

Building a data analytics platform with Hadoop, Python and R

ITG Software Engineering

HDP Hadoop From concept to deployment.

MapR: Best Solution for Customer Success

The Future of Data Management with Hadoop and the Enterprise Data Hub

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Connecting Hadoop with Oracle Database

Large scale processing using Hadoop. Ján Vaňo

BIG DATA What it is and how to use?

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Unified Big Data Processing with Apache Spark. Matei

Advanced Big Data Analytics with R and Hadoop

Prerequisites. Course Outline

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Parallel Data Warehouse

Cost-Effective Business Intelligence with Red Hat and Open Source

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Data Analyst Program- 0 to 100

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec

Big Data: Are You Ready? Kevin Lancaster

Turning Big Data into More Effective Customer Experiences. Experience the Difference with Lily Enterprise

SAP and Hortonworks Reference Architecture

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Splice Machine: SQL-on-Hadoop Evaluation Guide

Sentimental Analysis using Hadoop Phase 2: Week 2

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Microsoft Analytics Platform System. Solution Brief

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Oracle Big Data Essentials

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Open source large scale distributed data management with Google s MapReduce and Bigtable

Best Practices for Hadoop Data Analysis with Tableau

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data processing goes big

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Understanding the Value of In-Memory in the IT Landscape

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Workshop on Hadoop with Big Data

Big data blue print for cloud architecture

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Dell In-Memory Appliance for Cloudera Enterprise

Big Data: What You Should Know. Mark Child Research Manager - Software IDC CEMA

Making Sense of Big Data in Insurance

Virtualizing Apache Hadoop. June, 2012

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Big Data Integration: A Buyer's Guide

The 4 Pillars of Technosoft s Big Data Practice

Actian Vortex Express 3.0

Why Big Data in the Cloud?

Transcription:

KNIME & Avira, or how I ve learned to love Big Data

Facts about Avira (AntiVir) 100 mio. customers Extreme Reliability 500 employees (Tettnang, San Francisco, Kuala Lumpur, Bucharest, Amsterdam) Company owner, Mr. Auerbach 13.02.2014, 06:58

Big Data: Why did Avira decided to invest? Data storage has grown significantly after 2000 Computation capacity has also risen sharply Avira 2011 3

Reason 1: Because the McKinsey prophets said it may bring us more money

Reason 2: Because we are sitting on the top of several dozens of terabytes of anonymized customer data. The only challenge was to link this data and converge towards a single customer profile

Reason 3: Because we want to perform customer profilling & next best offer marketing to increase our margin Business objectives Deliver the optimal price, features & messaging to each customer in order to maximize EBIT from sales of new and renewal licenses Protect long-term margins by making each offer timely and unrepeatable (e.g., unique to a specific customer, product, event) Learn the underlying mechanics of features and price-elasticity on the level of customer cells Enable business to optimize campaign portfolio (i.e., über-algorithm traffics campaigns) A/B testing in all customer touch-points Technology requirements Ability to link a specific offer in a specific touch-point to specific customer. Machine-learning over all design dimensions to continually improve performance of the application Causal reporting to maximize the learning effect in the organization from algorithmic approaches to marketing and automation Create customer, product and behavioral tables from Avira s raw data within the dev. environment Create machine-learning algorithms optimizing the offer (price, features) per user-session Implement the services on development platform and place in listening mode to train Setup of the Hadoop framework (HDFS & MapReduce) & Couchbase, KNIME & Impala

Reason 4: Because it s fun (isn t this beautiful?)

Data Campaign sources Data Processing configuration Campaign Execution & Reporting Our business/architectural vision Communication touchpoints Avira.com E-Mail Notifier IPM CCC /offline Enterprise CRM Sugar Salesforce Reporting (Enterprise BI) Service Bus Cognos MIS Product Recommender Marketing Campaign Offers Generation & Redemption Campaign Manager Customer ID & Authentication Hadoop MapReduce Job management framework HDFS ETL (real time + bulk) Data Analysis Knime R PIG Master Data Social Media Click-Stream Monitoring Marketing Server Logs Product data collecting Billing & Payments Avira 2011

Our Daily Data Website logs: 5.000.000 lines Installation logs: 2.200.000 lines InProduct Messaging: 43.000.000 lines Download/Updater logs: 2.000.000.000 lines 60GB of daily compressed data or around 2TB of monthly data

Finding the Right Tool Our data is: Unstructured; Messy Coming from all kinds of sources: log files, log tables, relational databases We want to: Gather and store historical data Process huge amounts of it Support both batch and real-time operations

Management Node Master Nodes 10Gbit Network Switch Worker Node 12 72 GB 30 TB Worker Nodes 60 360 GB 35 TB

Using The Right Tool

MapReduce Full control over how the data is processed Works on structured and unstructured data Good for very complex business logic You have to write Java code Restricted to the MapReduce programming model Some things are difficult to code (JOINs, custom sorting)

Hive You write SQL-like queries Great for ad-hoc queries, data exploration Very fast development Works only on structured data Gets ugly if the business logic is complex

Impala Like Hive but A LOT faster Runs directly in memory Delivers almost real-time results Limited to in-memory processing Unreliable

KNIME and Hadoop at Avira Jdbc: Hive Impala Userid Password Table + SQL Statement Row Count Limit = 100000 Access SQL Evaluate / Override

Knime for Understanding the Data MapReduce Hive / Impala KNIME

Use Crisp-DM! Some Big Data Practicalities Big Data Methods Never Needed Decision: Would Big Data methods add value? Big Data Methods Possibly useful Decision: Would Big Data methods add value? Big Data Methods Possibly useful

Why did we decided to go with KNIME? The dark side of the moon: a typical symptom for home-grown business applications. KNIME helped us to tie a knot for the multiple uncorrelated data points and create customer 360 tables

Why did we decided to go with KNIME? It helped us move from code based data mining towards workflow based analytics; Analytics for everyone, easy to explain to all management/company levels

Some results: using KNIME & Tableau we ve managed to perform forensycs and license outlier analysis Our notorious hacker friend working for the overall good of torrent visitors This is me using Avira in Sep 2013 Hi!

Running k-means in KNIME to identify relevant clusters for Germany by looking at their antivirus software update behaviour Graph of clusters by time of day % usage:

Running k-means in KNIME to identify relevant clusters for Germany by looking at their antivirus software update behaviour Graph of clusters by daily % usage

Using KNIME to identify the real License Renewal pattern of our customers The timing of renewing an Avira license in % Reason 6 = 12%

Using KNIME to do standardized reporting of our license renewal metrics

Next steps Identify unknown groups of customers by allowing the machine to find patterns in data for creating special association rules/product recommendations & next best offer; test & train in KNIME, real-time model execution in Couchbase; NBO via Email NBO via In App Messaging 28 Avira 2011

Thank you. radu.pastia@avira.com/ phil.winters@knime.com/ florin.veringa@avira.com