Student Project 2 - Apps Frequently Installed Together



Similar documents
Student Project 1 - Explorative Data Analysis with Hadoop and Spark

INTRODUCTION TO CASSANDRA

A programming model in Cloud: MapReduce

Customer Case Study. Sharethrough

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Databricks. A Primer

Cloud computing - Architecting in the cloud

Databricks. A Primer

HYPER-CONVERGED INFRASTRUCTURE STRATEGIES

Sisense. Product Highlights.

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Getting to Know Big Data

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

How To Make Sense Of Data With Altilia

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Assignment # 1 (Cloud Computing Security)

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Big Data. Facebook Wall Data using Graph API. Presented by: Prashant Patel Jaykrushna Patel

How To Handle Big Data With A Data Scientist

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

This survey addresses individual projects, partnerships, data sources and tools. Please submit it multiple times - once for each project.

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

From Spark to Ignition:

Moving From Hadoop to Spark

Apigee Insights Increase marketing effectiveness and customer satisfaction with API-driven adaptive apps

Can the Elephants Handle the NoSQL Onslaught?


Real Time Big Data Processing

Preparing Your Data For Cloud

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

6 Steps to Faster Data Blending Using Your Data Warehouse

Big Data and Analytics: Challenges and Opportunities

Big Data: Big N. V.C Note. December 2, 2014

Automated Machine Learning For Autonomic Computing

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

7 Steps to Successful Data Blending for Excel

Journée Thématique Big Data 13/03/2015

Google Cloud Platform The basics

User Data Analytics and Recommender System for Discovery Engine

Cloud Computing Summary and Preparation for Examination

FREE computing using Amazon EC2

COMP9321 Web Application Engineering

Data Discovery and Systems Diagnostics with the ELK stack. Rittman Mead - BI Forum 2015, Brighton. Robin Moffatt, Principal Consultant Rittman Mead

An Approach to Implement Map Reduce with NoSQL Databases

Your Mission: Use F-Response Cloud Connector to access Google Apps for Business Drive Cloud Storage

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

Client Overview. Engagement Situation. Key Requirements

Big Data Too Big To Ignore

Scalable Architecture on Amazon AWS Cloud

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Doing Multidisciplinary Research in Data Science

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Challenges for Data Driven Systems

Hadoop & Spark Using Amazon EMR

Big Data & Netflix. Paul Ellwood February 9th, 2015

How To Manage Marketing With A Cloud Based Software

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Search Engine Marketing(SEM)

Data Analytics Infrastructure

Getting Started with Hadoop with Amazon s Elastic MapReduce

Analyzing Big Data with AWS

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Big Data and Analytics (Fall 2015)

[Hadoop, Storm and Couchbase: Faster Big Data]

Oracle Big Data Handbook

Sentimental Analysis using Hadoop Phase 2: Week 2

Networking in the Hadoop Cluster

Improving MapReduce Performance in Heterogeneous Environments

Send hyper-personalized s based on revolutionary predictive algorithms and increase revenues by 30%.

The Stratosphere Big Data Analytics Platform

Putting Apache Kafka to Use!

Open source large scale distributed data management with Google s MapReduce and Bigtable

Concentrate Observe Imagine Launch

WE RUN SEVERAL ON AWS BECAUSE WE CRITICAL APPLICATIONS CAN SCALE AND USE THE INFRASTRUCTURE EFFICIENTLY.

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

Benchmarking Sahara-based Big-Data-as-a-Service Solutions. Zhidong Yu, Weiting Chen (Intel) Matthew Farrellee (Red Hat) May 2015

Overview of edx Analytics

TRAINING PROGRAM ON BIGDATA/HADOOP

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

Transcription:

Student Project 2 - Apps Frequently Installed Together 42matters is a rapidly growing start up, leading the development of next generation mobile user modeling technology. Our solutions are used by big brand companies within the mobile advertising market to serve mobile users intelligently targeted content. We are an international team, with an innovative and fastpaced company culture. Project Overview The collected anonymized data about mobile devices needs to be used for different data analytics tasks. The data is stored in an online transaction processing system (shortly refereed to as online system in the following) which is not suitable for this type of tasks. The goal of the project is to set up a system which allows offline data analytics based on Hadoop/Spark. The whole system will be implemented on Amazon AWS. Main activities of the project are: Load data from the online system into Hadoop/Spark. Structure and prepare the data to be suitable for required data analytics tasks. Implement and run data analytics tasks. The system will be built on a stack of MongoDB, Couchbase, and Hadoop cluster running on Amazon cloud. More details about the three parts are described in the following sections. In the figure below an overview of the systems involved in the project is provided. On the left side there is the online system which stores all production data. Data is stored in two different database systems, Couchbase and MongoDB. This part of the system will be provided. On the right site there is the offline system which needs to be implemented. Data from the online system has to be loaded into the created offline system and analyzed there. The whole system will be created in Amazon AWS (user credentials for Amazon AWS will be provided by 42matters).

Data Sources Structure The source data used in the project is data about mobile devices and about apps. Devices are stored in Couchbase, whereas, apps in MongoDB. Both, devices and apps, are stored in JSON format: Apps Apps are stored in a MongoDB collection (a collection in MongoDB corresponds to a table in a relational database). Each app is represented by a JSON document (which corresponds to a row in a table of a relational database) containing among others the following fields: package name: The unique identifier of an app title: The title of an app description: The description of an app category: The Google Play category the app belongs to rating: The rating of the app on Google Play. The following example represents the app document for the Facebook app:

package_name : com.facebook.katana, title : Facebook, description : Keeping up with friends is faster than ever.., category : Social rating : 4.0, } This collection contains about 1 million apps. Devices Devices are stored in Couchbase buckets (a bucket in Couchbase corresponds to a table in a relational database). Each device is represented by a JSON document (which corresponds to a row in a table of a relational database) containing among others the following fields: udid: The unique identifier of a device country: The country of the device timestamp: Timestamp of the last update of the document apps: List of apps installed on the device (apps are identified by their package name) The following example represents a device which among others has the Facebook app installed: udid : ++/OarsCrkiQx5EyY/XTVxOwc4m1H2Re3m+CdiW+YeU=, country : CH, timestamp : ISODate("2014 09 24T10:18:56.531Z"), apps : [ fit : ISODate("2013 10 12T03:47:39Z"), lut : ISODate("2014 6 22T12:15:19Z"), pn : playboard.android }, fit : ISODate("2010 05 18T08:43:32Z"), lut : ISODate("2014 09 24T10:11:46Z"), pn : com.facebook.katana }, ], } This bucket contains millions of devices.

Offline Computations Requirements The system to be built will enable offline computations on large amounts of data. Offline computations can range from simple aggregations over devices and apps as well as more complex algorithms. For the implementation of the offline computations Python and Spark has to be used (Spark provides a Python API). The output of each computation has to be stored back in Hadoop (or into the online system, MongoDB and Couchbase). Some examples of offline computations to be supported are: Count number of devices having a specific app installed (device app frequency, DAF). This has to be computed for all apps: Package name Count (DAF) com.facebook.katana 1,850,300 Count number of devices having a given pair of apps installed together (device apps pair frequency, DAPF). The DAPF has to be computed for all pairs of apps appearing on any of the devices. Package name #1 Package name #2 Count (DAPF) com.facebook.katana playboard.android 126,000 Compute a score for all given pair of apps based on their DAPF and the DAF of the apps of the pair. Note that for any two apps, app1 and app2, two paris has to be considered, i.e. pair (app1, app2) and pair (app2, app1). The specific formula for computing the pair score will be provided by 42matters. Package name #1 Package name #2 Score com.facebook.katana playboard.android 0.68 Compute the score defined in the previous point over all devices (global) and for devices in specific countries, i.e. CH, DE, US, IT.

Country Package name #1 Package name #2 Score GLOBAL com.facebook.katana playboard.android 0.68 CH com.facebook.katana playboard.android 0.56 Project Tasks The project requires several tasks to be accomplished: Loading data from Couchbase and MongoDB into Hadoop. There exist Hadoop connectors which allow to connect to Couchbase and MongoDB. This connectors can be used to extract the data about devices and apps from the source systems in order to load it into Hadoop. Computation implementation After having loaded the data about devices and apps into Hadoop/Spark the different offline computation algorithms described in the previous section need to be implemented. A challenging part of this task might be to handle the large amount of intermediate data generated by the computations. In fact every possible pair of apps found on a device needs to be addressed which in the worst case is the number of apps squared (in practice it is less). Note that and advantage of using Spark in this project (with respect to using Map Reduce) is that intermediate data doesn t need to be stored to disk. Challenges Understanding and using the technology stack Mastering the distributed model of Hadoop/Spark Implementing the offline algorithms in a performant way