Student Project 1 - Explorative Data Analysis with Hadoop and Spark 42matters is a rapidly growing start up, leading the development of next generation mobile user modeling technology. Our solutions are used by big brand companies within the mobile advertising market to serve mobile users intelligently targeted content. We are an international team, with an innovative and fast paced company culture. Project Overview The collected anonymized data about mobile devices needs to be used for different data analytics tasks. The data is stored in an online transaction processing system (shortly refereed to as online system in the following) which is not suitable for this type of tasks. The goal of the project is to set up a system which allows offline data analytics based on Hadoop/Spark. The whole system will be implemented on Amazon AWS. Main activities of the project are: Load data from the online system into Hadoop/Spark. Structure and prepare the data to be suitable for required data analytics tasks. Implement and run data analytics tasks. The system will be built on a stack of MongoDB, Couchbase, and Hadoop cluster running on Amazon cloud. More details about the three parts are described in the following sections. In the figure below an overview of the systems involved in the project is provided. On the left side there is the online system which stores all production data. Data is stored in two different database systems, Couchbase and MongoDB. This part of the system will be provided. On the right site there is the offline system which needs to be implemented. Data from the online system has to be loaded into the created offline system and analyzed there. The whole system will be created in Amazon AWS (user credentials for Amazon AWS will be provided by 42matters).
Data Sources Structure The source data used in the project is data about mobile devices and about apps. Devices are stored in Couchbase, whereas, apps in MongoDB. Both, devices and apps, are stored in JSON format: Apps Apps are stored in a MongoDB collection (a collection in MongoDB corresponds to a table in a relational database). Each app is represented by a JSON document (which corresponds to a row in a table of a relational database) containing among others the following fields: package name: The unique identifier of an app title: The title of an app description: The description of an app category: The Google Play category the app belongs to. 42category (optional): Similar to the field category but more fine granular. rating: The rating of the app on Google Play.
The following example represents the app document for the Facebook app: package_name : com.facebook.katana, title : Facebook, description : Keeping up with friends is faster than ever.., category : Social, 42category : Social Network, rating : 4.0, } This collection contains about 1 million apps. Devices Devices are stored in Couchbase buckets (a bucket in Couchbase corresponds to a table in a relational database). Each device is represented by a JSON document (which corresponds to a row in a table of a relational database) containing among others the following fields: udid: The unique identifier of a device country: The country of the device timestamp: Timestamp of the last update of the document apps: List of apps installed on the device (apps are identified by their package name) The following example represents a device which among others has the Facebook app installed: udid : ++/OarsCrkiQx5EyY/XTVxOwc4m1H2Re3m+CdiW+YeU=, country : CH, timestamp : ISODate("2014 09 24T10:18:56.531Z"), apps : [ fit : ISODate("2013 10 12T03:47:39Z"), lut : ISODate("2014 6 22T12:15:19Z"), pn : playboard.android }, fit : ISODate("2010 05 18T08:43:32Z"), lut : ISODate("2014 09 24T10:11:46Z"), pn : com.facebook.katana }, ], } This bucket contains millions of devices.
Data Analytics Requirements The system to be build will enable explorative data analytics, i.e. it will allow to explore devices and apps data by executing SQL like queries (e.g. by using Hive over Spark or Spark QL). In the following some examples of queries to be supported by the system: number of devices having a specific app installed. Package name com.facebook.katana 1,850,300 the number of existing apps per app category. Category Social 2,500 Business 7,430 Find the top 10 most installed apps in the countries CH, DE, US, IT. Compute the cumulative count of apps installation on devices based on the app rankings (also grouped by country): ry Ranking Range App Installations Cumulative App Installations CH (4.0 5.0] 15,000,000 15,000,000 CH (3.0 4.0] 10,000,000 25,000,000 CH (2.0 3.0] 2,000,000 27,000,000 CH (1.0 2.0] 1,000,000 28,000,000 CH [0.0 1.0] 500,000 28,500,000 DE (4.0 5.0] 75,000,000 75,000,000
Percentage of the top 1000 apps apps (apps with most installations on devices) per country with a 42category (Note: this query requires first to indentify the top 1000 apps per country and then the percentage of them having a 42category). Average percentage of apps having a 42category, per device and country. Top 10 apps (apps with most installations on devices) per country which do not have any 42category. Project Tasks The project requires several tasks to be accomplished: Loading data from Couchbase and MongoDB into Hadoop. There exist Hadoop connectors which allow to connect to Couchbase and MongoDB. This connectors can be used to extract the data about devices and apps from the source systems in order to load it into Hadoop. Data modeling Data from the source systems needs to be modelled (e.g. into Hive tables) in Hadoop in a way to allow the above queries to be expressed. A challenging part of this task might be to bring the devices document into a tabular structure. Indeed, each device document contains a list of apps installed on that device and this list can have a different length on each device. Query writing Based on the defined data model queries need to be written to answer the data analytics requirements described in the previous section. Challenges Understanding and using the technology stack Mastering the distributed model of Hadoop/Spark Mastering the SQL like query language to accomplish the data analytics tasks. (Optional) Tableau Software Integration Tableau Software is a tool for explorative data analysis. It allows to connect to different data sources and to explore the data graphically. Tableau Software could use the Hadoop/Spark/Hive cluster as a data source allowing to explore the data in Hadoop graphically and to create dashboards.