Student Project 2 - Apps Frequently Installed Together 42matters is a rapidly growing start up, leading the development of next generation mobile user modeling technology. Our solutions are used by big brand companies within the mobile advertising market to serve mobile users intelligently targeted content. We are an international team, with an innovative and fastpaced company culture. Project Overview The collected anonymized data about mobile devices needs to be used for different data analytics tasks. The data is stored in an online transaction processing system (shortly refereed to as online system in the following) which is not suitable for this type of tasks. The goal of the project is to set up a system which allows offline data analytics based on Hadoop/Spark. The whole system will be implemented on Amazon AWS. Main activities of the project are: Load data from the online system into Hadoop/Spark. Structure and prepare the data to be suitable for required data analytics tasks. Implement and run data analytics tasks. The system will be built on a stack of MongoDB, Couchbase, and Hadoop cluster running on Amazon cloud. More details about the three parts are described in the following sections. In the figure below an overview of the systems involved in the project is provided. On the left side there is the online system which stores all production data. Data is stored in two different database systems, Couchbase and MongoDB. This part of the system will be provided. On the right site there is the offline system which needs to be implemented. Data from the online system has to be loaded into the created offline system and analyzed there. The whole system will be created in Amazon AWS (user credentials for Amazon AWS will be provided by 42matters).
Data Sources Structure The source data used in the project is data about mobile devices and about apps. Devices are stored in Couchbase, whereas, apps in MongoDB. Both, devices and apps, are stored in JSON format: Apps Apps are stored in a MongoDB collection (a collection in MongoDB corresponds to a table in a relational database). Each app is represented by a JSON document (which corresponds to a row in a table of a relational database) containing among others the following fields: package name: The unique identifier of an app title: The title of an app description: The description of an app category: The Google Play category the app belongs to rating: The rating of the app on Google Play. The following example represents the app document for the Facebook app:
package_name : com.facebook.katana, title : Facebook, description : Keeping up with friends is faster than ever.., category : Social rating : 4.0, } This collection contains about 1 million apps. Devices Devices are stored in Couchbase buckets (a bucket in Couchbase corresponds to a table in a relational database). Each device is represented by a JSON document (which corresponds to a row in a table of a relational database) containing among others the following fields: udid: The unique identifier of a device country: The country of the device timestamp: Timestamp of the last update of the document apps: List of apps installed on the device (apps are identified by their package name) The following example represents a device which among others has the Facebook app installed: udid : ++/OarsCrkiQx5EyY/XTVxOwc4m1H2Re3m+CdiW+YeU=, country : CH, timestamp : ISODate("2014 09 24T10:18:56.531Z"), apps : [ fit : ISODate("2013 10 12T03:47:39Z"), lut : ISODate("2014 6 22T12:15:19Z"), pn : playboard.android }, fit : ISODate("2010 05 18T08:43:32Z"), lut : ISODate("2014 09 24T10:11:46Z"), pn : com.facebook.katana }, ], } This bucket contains millions of devices.
Offline Computations Requirements The system to be built will enable offline computations on large amounts of data. Offline computations can range from simple aggregations over devices and apps as well as more complex algorithms. For the implementation of the offline computations Python and Spark has to be used (Spark provides a Python API). The output of each computation has to be stored back in Hadoop (or into the online system, MongoDB and Couchbase). Some examples of offline computations to be supported are: Count number of devices having a specific app installed (device app frequency, DAF). This has to be computed for all apps: Package name Count (DAF) com.facebook.katana 1,850,300 Count number of devices having a given pair of apps installed together (device apps pair frequency, DAPF). The DAPF has to be computed for all pairs of apps appearing on any of the devices. Package name #1 Package name #2 Count (DAPF) com.facebook.katana playboard.android 126,000 Compute a score for all given pair of apps based on their DAPF and the DAF of the apps of the pair. Note that for any two apps, app1 and app2, two paris has to be considered, i.e. pair (app1, app2) and pair (app2, app1). The specific formula for computing the pair score will be provided by 42matters. Package name #1 Package name #2 Score com.facebook.katana playboard.android 0.68 Compute the score defined in the previous point over all devices (global) and for devices in specific countries, i.e. CH, DE, US, IT.
Country Package name #1 Package name #2 Score GLOBAL com.facebook.katana playboard.android 0.68 CH com.facebook.katana playboard.android 0.56 Project Tasks The project requires several tasks to be accomplished: Loading data from Couchbase and MongoDB into Hadoop. There exist Hadoop connectors which allow to connect to Couchbase and MongoDB. This connectors can be used to extract the data about devices and apps from the source systems in order to load it into Hadoop. Computation implementation After having loaded the data about devices and apps into Hadoop/Spark the different offline computation algorithms described in the previous section need to be implemented. A challenging part of this task might be to handle the large amount of intermediate data generated by the computations. In fact every possible pair of apps found on a device needs to be addressed which in the worst case is the number of apps squared (in practice it is less). Note that and advantage of using Spark in this project (with respect to using Map Reduce) is that intermediate data doesn t need to be stored to disk. Challenges Understanding and using the technology stack Mastering the distributed model of Hadoop/Spark Implementing the offline algorithms in a performant way