Student Project 1 - Explorative Data Analysis with Hadoop and Spark



Similar documents
Student Project 2 - Apps Frequently Installed Together

INTRODUCTION TO CASSANDRA

Customer Case Study. Sharethrough

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

From Spark to Ignition:

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Sisense. Product Highlights.

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Big Data & Netflix. Paul Ellwood February 9th, 2015

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Native Connectivity to Big Data Sources in MSTR 10

Real Time Big Data Processing

Data Analytics Infrastructure

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Getting to Know Big Data

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Overview of edx Analytics

Big Data on Google Cloud

The Inside Scoop on Hadoop

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Winning Against All Odds: Big Data for the Budget Travel Industry. Silviu Preoteasa Head of Marketing Technology

How to Navigate Big Data with Ad Hoc Visual Data Discovery Data technologies are rapidly changing, but principles of 30 years ago still apply today

Increasing revenue realization CASE STUDY. by leveraging. Big Data. Mobile marketing platform

MySQL Comes of Age. Robert Hodges Sr. Staff Engineer Percona Live London November 4, VMware Inc. All rights reserved.

So What s the Big Deal?

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

KNIME & Avira, or how I ve learned to love Big Data

Preparing Your Data For Cloud

Analyzing Big Data with AWS

Monetizing Millions of Mobile Users with Cloud Business Analytics

JAVASCRIPT CHARTING. Scaling for the Enterprise with Metric Insights Copyright Metric insights, Inc.

TOP 8 TRENDS FOR 2016 BIG DATA

WA2192 Introduction to Big Data and NoSQL EVALUATION ONLY

Lofan Abrams Data Services for Big Data Session # 2987


Creative Director. Inspire artists, programmers, producers and marketing staff to make the highest quality product possible

Databricks. A Primer

MapR: Best Solution for Customer Success

Tap into Hadoop and Other No SQL Sources

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Tableau Visual Intelligence Platform Rapid Fire Analytics for Everyone Everywhere

Tableau Online. Understanding Data Updates

Moving From Hadoop to Spark

Hadoop & Spark Using Amazon EMR

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015

Bringing Big Data to People

BCIT COMPUTING offers courses and credentials in SIX related information technology sectors

Databricks. A Primer

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Best Practices for Hadoop Data Analysis with Tableau

How To Handle Big Data With A Data Scientist

Big Data Spatial Analytics An Introduction

Customer Case Study. Automatic Labs

[Hadoop, Storm and Couchbase: Faster Big Data]

Oracle Big Data SQL Technical Update

Sentimental Analysis using Hadoop Phase 2: Week 2

Ali Ghodsi Head of PM and Engineering Databricks

Spil Games Enables 500% ROI, Cuts Week from Reporting Timeline

BIRT in the World of Big Data

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Learning Tree Training Pre-approved Training for Continuing Education Units (CEUs)

What Next for DBAs in the Big Data Era

Data processing goes big

Next-Generation Cloud Analytics with Amazon Redshift

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Azure Data Lake Analytics

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

How To Create A Large Data Storage System

NoSQL Data Base Basics

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Google Cloud Platform The basics

Dealing with Data Especially Big Data

Big Data Technologies Compared June 2014

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

How To Make Sense Of Data With Altilia

This Symposium brought to you by

6 Steps to Faster Data Blending Using Your Data Warehouse

Big Data and Industrial Internet

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Innolligence focuses on four capability areas using cutting age technologies

Hadoop in the Enterprise

Shark Installation Guide Week 3 Report. Ankush Arora

Big Data Success Step 1: Get the Technology Right

Big Data Use Case: Business Analytics

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Emerging Requirements and DBMS Technologies:

Bizzmaxx Intelligent Sales & Marketing Errol van Engelen Managing Director Errol.vanengelen@bizzmaxx.nl

Challenges for Data Driven Systems

SQL Server 2016 New Features!

Study concluded that success rate for penetration from outside threats higher in corporate data centers

Microsoft Power BI. Nov 21, 2015

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

<Insert Picture Here> Enhancing the Performance and Analytic Content of the Data Warehouse Using Oracle OLAP Option

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Open Source Technologies on Microsoft Azure

Transcription:

Student Project 1 - Explorative Data Analysis with Hadoop and Spark 42matters is a rapidly growing start up, leading the development of next generation mobile user modeling technology. Our solutions are used by big brand companies within the mobile advertising market to serve mobile users intelligently targeted content. We are an international team, with an innovative and fast paced company culture. Project Overview The collected anonymized data about mobile devices needs to be used for different data analytics tasks. The data is stored in an online transaction processing system (shortly refereed to as online system in the following) which is not suitable for this type of tasks. The goal of the project is to set up a system which allows offline data analytics based on Hadoop/Spark. The whole system will be implemented on Amazon AWS. Main activities of the project are: Load data from the online system into Hadoop/Spark. Structure and prepare the data to be suitable for required data analytics tasks. Implement and run data analytics tasks. The system will be built on a stack of MongoDB, Couchbase, and Hadoop cluster running on Amazon cloud. More details about the three parts are described in the following sections. In the figure below an overview of the systems involved in the project is provided. On the left side there is the online system which stores all production data. Data is stored in two different database systems, Couchbase and MongoDB. This part of the system will be provided. On the right site there is the offline system which needs to be implemented. Data from the online system has to be loaded into the created offline system and analyzed there. The whole system will be created in Amazon AWS (user credentials for Amazon AWS will be provided by 42matters).

Data Sources Structure The source data used in the project is data about mobile devices and about apps. Devices are stored in Couchbase, whereas, apps in MongoDB. Both, devices and apps, are stored in JSON format: Apps Apps are stored in a MongoDB collection (a collection in MongoDB corresponds to a table in a relational database). Each app is represented by a JSON document (which corresponds to a row in a table of a relational database) containing among others the following fields: package name: The unique identifier of an app title: The title of an app description: The description of an app category: The Google Play category the app belongs to. 42category (optional): Similar to the field category but more fine granular. rating: The rating of the app on Google Play.

The following example represents the app document for the Facebook app: package_name : com.facebook.katana, title : Facebook, description : Keeping up with friends is faster than ever.., category : Social, 42category : Social Network, rating : 4.0, } This collection contains about 1 million apps. Devices Devices are stored in Couchbase buckets (a bucket in Couchbase corresponds to a table in a relational database). Each device is represented by a JSON document (which corresponds to a row in a table of a relational database) containing among others the following fields: udid: The unique identifier of a device country: The country of the device timestamp: Timestamp of the last update of the document apps: List of apps installed on the device (apps are identified by their package name) The following example represents a device which among others has the Facebook app installed: udid : ++/OarsCrkiQx5EyY/XTVxOwc4m1H2Re3m+CdiW+YeU=, country : CH, timestamp : ISODate("2014 09 24T10:18:56.531Z"), apps : [ fit : ISODate("2013 10 12T03:47:39Z"), lut : ISODate("2014 6 22T12:15:19Z"), pn : playboard.android }, fit : ISODate("2010 05 18T08:43:32Z"), lut : ISODate("2014 09 24T10:11:46Z"), pn : com.facebook.katana }, ], } This bucket contains millions of devices.

Data Analytics Requirements The system to be build will enable explorative data analytics, i.e. it will allow to explore devices and apps data by executing SQL like queries (e.g. by using Hive over Spark or Spark QL). In the following some examples of queries to be supported by the system: number of devices having a specific app installed. Package name com.facebook.katana 1,850,300 the number of existing apps per app category. Category Social 2,500 Business 7,430 Find the top 10 most installed apps in the countries CH, DE, US, IT. Compute the cumulative count of apps installation on devices based on the app rankings (also grouped by country): ry Ranking Range App Installations Cumulative App Installations CH (4.0 5.0] 15,000,000 15,000,000 CH (3.0 4.0] 10,000,000 25,000,000 CH (2.0 3.0] 2,000,000 27,000,000 CH (1.0 2.0] 1,000,000 28,000,000 CH [0.0 1.0] 500,000 28,500,000 DE (4.0 5.0] 75,000,000 75,000,000

Percentage of the top 1000 apps apps (apps with most installations on devices) per country with a 42category (Note: this query requires first to indentify the top 1000 apps per country and then the percentage of them having a 42category). Average percentage of apps having a 42category, per device and country. Top 10 apps (apps with most installations on devices) per country which do not have any 42category. Project Tasks The project requires several tasks to be accomplished: Loading data from Couchbase and MongoDB into Hadoop. There exist Hadoop connectors which allow to connect to Couchbase and MongoDB. This connectors can be used to extract the data about devices and apps from the source systems in order to load it into Hadoop. Data modeling Data from the source systems needs to be modelled (e.g. into Hive tables) in Hadoop in a way to allow the above queries to be expressed. A challenging part of this task might be to bring the devices document into a tabular structure. Indeed, each device document contains a list of apps installed on that device and this list can have a different length on each device. Query writing Based on the defined data model queries need to be written to answer the data analytics requirements described in the previous section. Challenges Understanding and using the technology stack Mastering the distributed model of Hadoop/Spark Mastering the SQL like query language to accomplish the data analytics tasks. (Optional) Tableau Software Integration Tableau Software is a tool for explorative data analysis. It allows to connect to different data sources and to explore the data graphically. Tableau Software could use the Hadoop/Spark/Hive cluster as a data source allowing to explore the data in Hadoop graphically and to create dashboards.