Advanced Business Analytics using Distributed Computing (Hadoop)

Transcription

1 Advanced Business Analytics using Distributed Computing (Hadoop) MIS Final Project Submitted By: Mani Kumar Pantangi M - Management Information Systems Jon M. Huntsman School of Business Utah State University

2 Contents Understanding Data... 3 Problem Statement... 3 Problem Approach... 3 Hadoop Ecosystem... 3 Reports Description... 3 Data Formatting... 3 Generating Information... 4 Loading Data... 7 Running a Report... 7 Saving a Report... 7 Viewing Results Saved... 8 Archiving Files... 8 Introduction to Hadoop Terminologies... 9 Tools Introduction to Amazon Web Services (AWS) Reports Code References... 21

3 Understanding Data Data Source: Dice.com is a career website, serving information technology and engineering professionals. The dice database manages all its registered professionals and companies details. From professional s side, it includes the data that deals with person name, qualifications, soft and hard skills and many more. However, this is different from the companies data, which includes, what the position/title that company is looking for, when the job is posted, what are the skills required to hold the position, posting location, pay-rate, description of position etc.,. Problem Statement The primary problem is the format of data, which is not quite readable. And, does not carry the information in a broader perspective, as in how the jobs are with respect to location, what company is basically looking at what all distinct positions, how many positions, and how the jobs are distributed in the United States of America, what region of US has major jobs postings, what company is looking for a specific type of skill, and the questions continues. Problem Approach So, the basic problem with the data that we have is it s data format. If this can be resolved and put together in a readable format, rest of the analysis part can be done with the help of programing and querying languages. These languages help us in synthesize meaningful information from the raw data. In our problem approach we will first convert the data, which is in JSon format, to tabular format. Then on the resultant data we use Hive or PIG query languages, to construct the code, and deduce the information. Also, let us see how to create an account in Amazon Web Services and run the queries on cloud. Hadoop Ecosystem The following Hadoop tools are used to derive information from the given data. HIVE PIG Amazon Web Services (AWS) Reports Description Report 1: Data Formatting Our first and basic step is to convert the raw Json format data into readable format, i.e., to put the data in tabular format. Hive is well designed to work on tabular formats (ORC file formats). With HQL, we will designe reports to convert and upload Json format data into ORC file formats to HCatalog. Report 1a: Uploading Dice.com Json Data Data from dice.com Report 1b: Uploading usastates Json Data All US States and their respective abbreviations.

4 Generating Information Report 2: Well, we have the data now ready in readable format. However we haven t yet got any information out of it. Let s write a report to understand how the jobs are posted within USA. Kind of, identifying what percentage of jobs are posted across each state in USA. For this task we require the following operations to be performed on the data, in PIG. 1. Clean the data to remove empty records in the posted column. 2. As we are interested only in Location and Number of jobs, lets include only Dice ID and Location fields in the output, using FOREACH operator. 3. As the location column does not have the area given directly [ Logan, UT ] (About 80% data is in this format), it should be chopped to give only the states. For this, let s use a regular expressions logic to break the field and consider the data that is only after the comma using REGEX_EXTRACT operator. 4. Now we have got all area names separated from the location. These will include all areas from the world. Per our aim, we got to consider only States from USA. So let s consider joining the above output with usastates table, using JOIN operator. The output of this will be all USA states and their respective dice ids. 5. Now we have got our states and dice ids, let s group all dice ids with respect to States, using GROUP BY operator. 6. All the dice ids are now grouped, using a COUNT operator, count all dice ids. This gives us the number of jobs posted in each state. 7. However, this is not we are looking at. We need percentage of jobs posted. So, let s group all the records using GROUP ALL operator and calculate total number of jobs posted across all states in USA, using SUM function. 8. Now from total obtained above, it is easy to calculate the percentage of jobs posted in each state. Make sure to convert the data type to FLOAT when performing this operation. 9. Sort the output in descending order, using ORDER BY operator, to see the top postings state in the first position. 10. Display the output using DUMP operator. Visualization: Export the output data to Tableau, and plot the data on United States of America map. That should probably look as below:

5 Report 3: From the above output, it is clear that California has got more number of jobs posted on Dice.com. However, if I am interested in looking at the job postings region wise, the eastern side looks much denser than west. So let s perform further more operations and see how the jobs are dispersed. This time let s make use of a User Defined function (UDF), designed in Python, to split the data into 5 regions. 1. Continuing from the previous script, after generating total job count, in step 7, let s apply a UDF (usaregions.py) along with FOREACH operator, to generate 5 regions. 2. Now from regions above and total jobs in previous steps, it is easy to calculate the percentage of jobs posted in each region. Make sure to convert the datatype to float when performing this operation. 3. Sort and run the report. The output should display 5 regions and their respective percentage of jobs posted. Well, the output shows job are equally spread across Midwest Region (24%) and Southeast Region (24%). Report 4: All this while, we are looking at only US Job postings. Now I am interested in seeing the job postings of both US and Non-US places. This report makes use of a UDF and a SPLIT operator. 1. This script basically continues from Report 1 after step 3 of generating areas from location field. 2. In order to make sure we do not have any empty area fields (stale data), let s remove them using FILTER operator. 3. Now we can group each area with respect to their dice ids. This creates no duplication of records. 4. Using the COUNT operator, count all the dice ids for every area. 5. Now let us use SPLIT operator to partition the data into US_JobPosts and NonUS_JobPosts records. For this we need to make use of the UDF (areas.py). This UDF works in such a way that, when an area is passed through it, it verifies if the name is in US or not. If true, then puts the records in US_JobPosts table, else sends it to NonUS_JobPosts. 6. Run the query and view the results. 7. The output should show 2 groups, with all the USA states in US_JobPosts and Non-USA jobs posts in NonUS_JobPosts. Report 5: This report allows us to see the companies and their respective number of job postings on Dice.com. 1. From the dice data we have, there is no column for the Company. However if we closely observe the Title field, the 2 nd part is Company Name (80% accurate). 2. Using regular expressions REGEX_EXTRACT and SUBSTRING LAST_INDEX_OF functions, we can extract the company name from the Title field. 3. Generate a table Company Name and Dice ID columns using FOREACH operator. 4. Group the Companies with their dice ids using GROUP BY operator.

6 5. Count all the dice ids and then generate a table to display all the companies and their respective number of jobs posted. This report basically gives the idea of using basic string operators and from Business perspective it gives an idea of number of jobs each company posted on dice. Report 6: The previous report gives us only the number of jobs each company post on dice. However, if I am interested in looking at all the distinct positions that a company offers and number of distinct position, then we would take a different approach. The following report allows us to see list of all companies (registered in UDF), their respective job positions and total number of job positions. 1. Similar to the company name, Position is extracted from Title field. Position field is the first part of Title field. Using SUBSTRING and INDEXOF functions, Position is extracted from title field. 2. Generate a table of all companies and their respective positions. 3. Identify the distinct job positions using DISTINCT operator. (E.g.: considering the fact that a same company may post a same position in two different places) 4. Group the Companies with respect to each job posting, using GROUP BY operator. 5. Count all the distinct positions that the company looking for using COUNT operator. 6. Now, let s consider only a set of companies (~1700) list that are registered in UDF and run the UDF against the table generated in the previous step, using FOREACH operator. 7. This report basically gives us all the companies (~1700), their respective positions, and companies that are not register in UDF. Not registered companies will be filtered out using FILTER function. 8. Run the report and display output. 9. The output typically displays all the company names, number of distinct job positions and list of all job positions. The out shows, Modis is the company looking for 636 distinct positions, followed by Kforce Inc. with 596 positions across the globe. Report 7: This report is pretty basic and mainly designed to get acquainted with Amazon Web Services. We will see a detailed description of working with AWS. Coming to the report, it basically looks for each company and their respective skills requirement for all the jobs posted on dice. 1. As this is run on AWS, we have a slight different approach of loading data. A sample set of large data is gathered, and then loaded on to AWS via MobaXterm. 2. We are mainly focusing on Company and their Skills requirements. Knowing the fact that, we do not have a company field as such in the data base, we use various string operations to extract company name from the Title field, using SUBSTRING, REGEX_EXTRACT and LAST_INDEX_OF functions and get the records using FOREACH operator. 3. Now we have all the companies and their respective skills for particular job postings. But we are looking for all distinct companies and their respective skills. Hence, we should group all the companies using GROUP BY operator.

7 4. The GROUP BY operator generates a bag of values (skills) along with Key (company name). To make the result more sense, let us use a FLATTEN operator to pull all the skills from bag and display them as a normal words. 5. The output shows all the distinct companies and their respective skills requirement as below: BRiCK House Specialty Resources - HTML 5; CSS; Javascript and JQuery Loading Data Loading Data into Pig: HCatLoader interface is used within Pig scripts to read data from HCatalog-managed tables. The first interaction with HCatLoader happens while Pig is parsing the query and generating the logical plan. HCatLoader.getSchema is called, which causes HCatalog to query the HiveMetaStore for the table schema, which is used as the schema for all records in this table. Use LOAD command to load data into alias name, specify the table name in single quotes and then load using HCatLoader() interface. dice = LOAD 'default.orc_dice' USING org.apache.hcatalog.pig.hcatloader(); Running a Report Hive Reports: If you have partitions on the data that is to be invoked to HCatalog, then run the following command: hive -hiveconf FILEDATE=' ' -f /root/project_mpantangi/orcfile.hql is the data partition. /root/project_mpantangi/orcfile.hql is the directory where the file is located on. If no partitions, then use the following: hive -f /root/project_mpantangi/report1.hql Pig Reports: Pig does not automatically pick up HCatalog jars. To bring in the necessary jars, you should use a flag in the pig command on command line, as below: pig usehcatalog -f report5.pig pig usehcatalog flag ensures that all the jar files required for executing the report are ready. Saving a Report Pig Reports: Use STORE function to store output. HCatStorer() interface can be used, if you would like to write data to HCatalog-managed tables.

8 STORE dicetop20 INTO 'report5_output' USING org.apache.hcatalog.pig.hcatstorer(); Note: if no interface specified, the output will be saved as a text file. STORE dicetop20 INTO 'report5_output'; Viewing Results Saved The files are saved on HDFS, to access the files, write the following sequence of commands on the command line. 1. View all the files saved on HDFS hdfs dfs ls 2. Once you see the folder that you are interested in, enter the following command. hdfs dfs -ls reportx_output 3. The folder may show you list of files in it, your output will be ideally saved in the part-r-0000 file. hdfs dfs -cat reportx_output/part-r View the contents of the file using the below command and save it in any file on your local machine, if required: hdfs dfs -cat report2_output/part-r > report2_output.csv Archiving Files In UNIX, the name of the tar command is short for tape archiving. Common use for tar is to simply combine a few files into a single file, for easy storage and distribution. To combine multiple files and/or directories into a single file, use the following command: tar cvfp mpantangi.tar project_mpantangi cvfp Creates a tar file. mpantangi.tar File name.tar project_mpantangi Directory being archived. Similarly, to extract files from an archived folder, we use the following command: tar xvfp mpantangi.tar xvfp Extracts files from archived folder. mpantangi.tar Archived file.

9 Introduction to Hadoop Terminologies HDFS: The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines, HDFS is designed to be fault-tolerant due to data replication and distribution of data. When a file is loaded into HDFS, it is replicated and broken up into "blocks" of data, which are stored across the cluster nodes designated for storage, a.k.a. DataNodes. HDFS requires a NameNode process to run on one node in the cluster and a DataNode service to run on each "slave" node that will be processing data. The NameNode is responsible for storage and management of metadata, so that when MapReduce or another execution framework calls for the data, the NameNode informs it where the needed data resides. MapReduce: MapReduce is a programming model for data processing. Hadoop can run MapReduce programs written in various languages. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. An example for MR process below: HCatalog: HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools Pig, MapReduce, and Hive to more easily read and write data on the grid. HCatalog s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile formats. Pig: Pig is a high level scripting language that is used with Apache Hadoop. Pig runs on Hadoop and makes use of MapReduce and the Hadoop Distributed File System (HDFS). The language for the platform is called Pig Latin. Pig translates the Pig Latin script into MapReduce so that it can be executed within Hadoop. Pig Latin is a flow language whereas SQL is a declarative language. SQL is great for asking a question of your data, while Pig Latin allows you to write a data flow that describes how your data will be transformed. Pig latin can become extensible when introduced with user defined functions using Java, Python, Ruby, or other scripting languages. HIVE: The tables in Hive are similar to tables in a relational database. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data. Hive s SQL dialect, called HiveQL, does not support the full SQL-92 specification. Furthermore, Hive has some extensions that are not in SQL-92, inspired by syntax from other

10 database systems, notably MySQL. In fact, to a first-order approximation, HiveQL most closely resembles MySQL s SQL dialect. Data analysts use Hive to explore, structure and analyze that data, then turn it into business insight. ORC File Format: The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data. Json File Format: The JavaScript Object Notation (JSon) file format is just like XML. This is lightweight data interchange format, "self-describing" and easy to understand and language independent. This text can be read and used as a data format by any programming language. UDF: As discussed earlier PIG can become extensible if user defined functions are used. Pig UDFs can currently be implemented in six languages: Java, Jython, Python, JavaScript, Ruby and Groovy. A UDF is function that runs every time when a Pig query is executed. Pig also provides support for Piggy Bank, a repository for Java UDFs. Note: Before using any UDF it must be registered. Registering a UDF: You should register a Jython script as shown below before using the UDF in your PIG scripts. This example uses to interpret the Jython script. Currently, Pig identifies jython as a keyword and ships the required scriptengine (jython) to interpret it. REGISTER 'companypass.py' USING org.apache.pig.scripting.jython.jythonscriptengine AS companyudf; Dice Data: Sample Dice data format and list of fields available for consideration. { "all": [], "Description": ["<div id=\"detaildescription\"><p>the opportunity is with predictive analytics team. The primary responsibility of this position is to provide technical assistance by performing data migration, data analytics, economic modelling, simulation, production analytics and reporting. Additional responsibilities include data cleansing, data analysis, project planning and management, developing custom workflows, and presenting and discussing discoveries. </p></div>"], "Title": ["Data Scientist - Oxyprime LLC - Florham Park, NJ dice.com "], "Skills": ["SQL, MS Access and Excel, Statistics, SSIS, SAP Business objects, Basic programming skills (C++, Java, Visual Basic,.net development), Analytical skills\u00a0"], "Pay_Rate": ["DOE\u00a0"], "Area_Code": ["973\u00a0"], "Telecommute": ["no\u00a0"], "Position_ID": ["902IDS\u00a0"], "Length": ["12+ months\u00a0"], "Link": " "Location": ["Florham Park, NJ"], "Dice_ID": [" \u00a0"], "Tax_Term": ["CON_CORP\u00a0"], "Posted": [" \u00a0"], "Travel_Req": ["none\u00a0"] }

11 Tools Oracle VM VirtualBox: Version VirtualBox is a cross-platform virtualization application. For one thing, it installs on your existing Intel or AMD-based computers, whether they are running Windows, Mac, Linux or Solaris operating systems. Secondly, it extends the capabilities of your existing computer so that it can run multiple operating systems (inside multiple virtual machines) at the same time. So, for example, you can run Windows and Linux on your Mac, run Windows Server 2008 on your Linux server, run Linux on your Windows PC, and so on, all alongside your existing applications. You can install and run as many virtual machines as you like. Hartonworks Sandbox: A single-node Hadoop cluster, running in a virtual machine, implementation of the Hortonworks Data Platform (HDP). It is packaged as a virtual machine to make evaluation and experimentation with HDP fast and easy. MobaXterm: MobaXterm is an enhanced terminal for Windows. MobaXterm brings all the essential UNIX commands to Windows desktop, in a single portable exe file which works out of the box. Tableau: Tableau Desktop is data analysis and visualization tool. Amazon Web Services (AWS): AWS is cloud computing through Amazon Web Services.

12 Introduction to Amazon Web Services (AWS) Recently, virtualization has become a widely accepted way to reduce operating costs and increase the reliability of enterprise IT. In addition, grid computing makes a completely new class of analytics, data crunching, and business intelligence tasks possible that were previously cost and time prohibitive. Along with these technology changes, the speed of innovation and unprecedented acceleration in the introduction of new products has fundamentally changed the way markets work. Along with the wide acceptance of software as a service (SaaS) offerings, these changes have paved the way for the latest IT infrastructure challenge: cloud computing. Amazon Web Services (AWS) is a collection of remote computing services, also called web services that together make up a cloud computing platform by Amazon.com. The main purpose and advantage for a large organization with AWS is, the investment on hardware, software components can be reduced. Highly scalable, can ramp up the resource when in need and scale down when we can compute on available resource. We would require a Virtual Machine, Storage Space and Hadoop code to run our queries. The following 3 components of AWS serves our purpose. EC2 - Virtual Machine - Elastic Cloud Compute Amazon S3 - Storage Space - Simple Storage Service Amazon EMR - Amazon Hadoop - Elastic MapReduce Creating AWS Cluster: 1. Create an account on AWS at (To create an account you need to provide a valid credit card information). 2. From the AWS services select EC2 and create a new key pair. 3. Download and Save the PEM file generated. 4. From the AWS Services, select IAM to create user accounts. 5. Create User Accounts and Save the user credentials.

13 6. From the AWS Services, select EMR and create a Cluster. 7. When the setup is being in process, the cluster status transitions as following: STARTING The cluster provisions, starts, and configures EC2 instances. BOOTSTRAPPING Bootstrap actions are being executed on the cluster. RUNNING A step for the cluster is currently being run. WAITING The cluster is currently active, but has no steps to run. 8. In the waiting phase, turn on MobaXterm and connect to the Amazon Web Services. 9. On the SSH connection page give the following details: Remote Host: ec us-west-2.compute.amazonaws.com (generated during step 7 Master Public DNS) Specify Username: Hadoop 10. In the Advanced SSH Settings, select Use Private Key and file browse for the PEM file generated in step MobaXterm now communicates between AWS and you. 12. Performing all data manipulation operations through MobaXterm. 13. Soon after the work is done, terminate the session on AWS. 14. Go back to AWS Services and select EMR. On the EMR page click on Terminate button to terminate the cluster. The following message displayed on MobaXterm once the cluster is terminated. 15. The following possible messages will be displayed on the EMR Screen during/after termination. TERMINATING - The cluster is in the process of shutting down. TERMINATED - The cluster was shut down without error. TERMINATED_WITH_ERRORS - The cluster was shut down with errors.

14 Reports Code Report 1: This report imports the raw Jason format data into Hive tables using HQL. Report 1a: -- hive -hiveconf FILEDATE=' ' -f /root/project_mpantangi/orcfile.hql -- Read file in JSON format -- Save file in RCfile format -- Example: FILEDATE=' ' DROP TABLE json_dice; CREATE TABLE json_dice (jrecord string); -- default file is a comma-delimited text file DROP TABLE orc_dice; CREATE TABLE IF NOT EXISTS orc_dice ( posted string, title string, skills string, areacode string, location string, payrate string, positionid string, diceid string, length string, taxterm string, travelreq string, telecommute string, link string, description string ) PARTITIONED BY(scrapedate string) STORED AS ORC; LOAD DATA LOCAL INPATH "/root/project_mpantangi/dice.${hiveconf:filedate}.json" INTO TABLE json_dice; ALTER TABLE orc_dice DROP IF EXISTS PARTITION(scrapedate = "${hiveconf:filedate}"); INSERT OVERWRITE TABLE orc_dice PARTITION (scrapedate="${hiveconf:filedate}") SELECT regexp_extract(get_json_object(json_dice.jrecord, '$.Posted'), '\\["(.*)\\u00a0"\\]', 1) as posted, regexp_extract(get_json_object(json_dice.jrecord, '$.Title'), '\\["(.*)"\\]', 1) as title, regexp_extract(get_json_object(json_dice.jrecord, '$.Skills'), '\\["(.*)\\u00a0"\\]', 1) as skills, regexp_extract(get_json_object(json_dice.jrecord, '$.Area_Code'), '\\["(.*)\\u00a0"\\]', 1) as areacode, regexp_extract(get_json_object(json_dice.jrecord, '$.Location'), '\\["(.*)"\\]', 1) as location, regexp_extract(get_json_object(json_dice.jrecord, '$.Pay_Rate'), '\\["(.*)\\u00a0"\\]', 1) as payrate, regexp_extract(get_json_object(json_dice.jrecord, '$.Position_ID'), '\\["(.*)\\u00a0"\\]', 1) as positionid, regexp_extract(get_json_object(json_dice.jrecord, '$.Dice_ID'), '\\["(.*)\\u00a0"\\]', 1) as diceid,

15 regexp_extract(get_json_object(json_dice.jrecord, '$.Length'), '\\["(.*)\\u00a0"\\]', 1) as length, regexp_extract(get_json_object(json_dice.jrecord, '$.Tax_Term'), '\\["(.*)\\u00a0"\\]', 1) as taxterm, regexp_extract(get_json_object(json_dice.jrecord, '$.Travel_Req'), '\\["(.*)\\u00a0"\\]', 1) as travelreq, regexp_extract(get_json_object(json_dice.jrecord, '$.Telecommute'), '\\["(.*)\\u00a0"\\]', 1) as telecommute, get_json_object(json_dice.jrecord, '$.Link') as link, regexp_extract(get_json_object(json_dice.jrecord, '$.Description'), '\\["(.{200}).*"\\]', 1) as description FROM json_dice; Report 1b: -- hive -f /root/project_mpantangi/report1.hql -- Read file in JSON format -- Save file in RCfile format DROP TABLE json_usa_states; CREATE TABLE json_usa_states ( jrecord string ); -- default file is a comma-delimited text file DROP TABLE usa_states; CREATE TABLE usa_states ( States string, Abbreviation string ) STORED AS ORC; LOAD DATA LOCAL INPATH "/root/project_mpantangi/usa_states.json" INTO TABLE json_usa_states; INSERT OVERWRITE TABLE usa_states SELECT get_json_object(json_usa_states.jrecord, '$.States') as States, get_json_object(json_usa_states.jrecord, '$.Abbreviation') as Abbreviation FROM json_usa_states; Report 2: This report mainly focuses on filtering US states from dice data and reporting their respective percentage of jobs posting with respect to overall US. -- pig -usehcatalog -f report2.pig -- Load Dice data dice = LOAD 'default.orc_dice' USING org.apache.hcatalog.pig.hcatloader(); -- Load United States of America states data usastates = LOAD 'default.usa_states' USING org.apache.hcatalog.pig.hcatloader(); -- FILTER out the first line of file to avoid errors with pig 0.12 dice_filtered = FILTER dice BY ((posted IS NOT NULL) AND (posted!= '')); -- Generate location and dice id columns from the whole data dice_location = FOREACH dice_filtered GENERATE location, diceid;

16 -- From Location field, extract state names and make sure no stale data(special characters, single characters etc) are displayed. dice_area = FOREACH dice_location GENERATE REGEX_EXTRACT(location, '((?<=, ).*?([^\n]+))', 1) AS area, diceid; -- The regular expression above, will display all the states/areas from the location field that is after comma. (Eg: Mumbai from [Mumbai, India] or UT from [Logan, UT]) -- Group area with job postings area_jobs = GROUP dice_area BY area; -- Show how many jobs each area have been posted with jobs_count = FOREACH area_jobs GENERATE group, COUNT(dice_area.diceid) AS jobscount; -- Filterout jobs posted in United States of America usa_jobposts = JOIN jobs_count BY group, usastates BY abbreviation; -- Display State Name, State Abbreviation and Jobscount state_jobs = FOREACH usa_jobposts GENERATE states, jobscount; --Group all sum_jobs = GROUP state_jobs ALL; -- generate sum total_jobs = FOREACH sum_jobs GENERATE SUM(state_jobs.jobscount) as totaljobs; -- percent jobs_percent = FOREACH state_jobs GENERATE states, (jobscount/(float)total_jobs.totaljobs)*100 AS jobpercent; -- Sort the jobs count to show the most number of job postings in each state job_topstates = ORDER jobs_percent BY jobpercent DESC; -- Display top 5 states --top5 = LIMIT job_topstates 5; -- Display the output, the result should show all the states and the number of jobs posted. --DUMP top5; -- Display all the states in US and their respective number of postings. --DUMP job_topstates; -- Save the file to "report2_output" STORE job_topstates INTO 'Report2_output'; Report 3: This report divides the data set into US and Non-US job postings and display number of jobs posted in each area respectively. --pig -usehcatalog -f report3.pig -- Load Dice data dice = LOAD 'default.orc_dice' USING org.apache.hcatalog.pig.hcatloader(); -- Generate location and dice id columns from the whole data dice_location = FOREACH dice GENERATE location, diceid;

17 -- From Location field, extract state names and make sure no stale data(special characters, single characters etc) are displayed. dice_area = FOREACH dice_location GENERATE REGEX_EXTRACT(location, '((?<=, ).*?([^\n]+))', 1) AS area, diceid; -- Filter records with no location/area useable_data = FILTER dice_area BY (area!= '') AND (area IS NOT NULL); -- Group area with job postings area_jobs = GROUP useable_data BY area; -- Show how many jobs each area have been posted with jobs_count = FOREACH area_jobs GENERATE group, COUNT(useable_data.diceid) AS jobscount; -- register udf REGISTER 'areas.py' USING org.apache.pig.scripting.jython.jythonscriptengine AS areaudf; --segregate the job postings into US and NonUS SPLIT jobs_count INTO US_JobPosts IF group == areaudf.checkifexists(group), NonUS_jobPosts IF group!= areaudf.checkifexists(group); -- Display total number of US posting and Non US Postings --DUMP US_JobPosts; --DUMP NonUS_jobPosts; -- Save the files into 'report3a_output' and 'report4a_output' STORE US_JobPosts INTO 'report3a_output'; STORE NonUS_jobPosts INTO 'report3b_output'; Report 4: This report shows the percentage of jobs spread across in each of 5 regions of US. -- pig -usehcatalog -f report4.pig -- Load Dice data dice = LOAD 'default.orc_dice' USING org.apache.hcatalog.pig.hcatloader(); -- Generate location and dice id columns from the whole data dice_location = FOREACH dice GENERATE location, diceid; -- From Location field, extract state names and make sure no stale data(special characters, single characters etc) are displayed. dice_area = FOREACH dice_location GENERATE REGEX_EXTRACT(Location, '((?<=, ).*?([^\n]+))', 1) AS area, diceid; -- Group area with job postings area_jobs = GROUP dice_area BY area; -- Show how many jobs each area have been posted with jobs_count = FOREACH area_jobs GENERATE group, COUNT(dice_area.diceid) AS jobscount; --group all areas groupall = GROUP jobs_count ALL; -- Calcualte the total jobs count totaljobs = FOREACH groupall GENERATE SUM(jobs_count.jobscount) AS total;

18 -- register udf REGISTER 'usaregions.py' USING org.apache.pig.scripting.jython.jythonscriptengine AS regionudf; --regions usa_regions = FOREACH jobs_count GENERATE regionudf.checkifexists(group) AS region, jobscount; --group regions group_regions = GROUP usa_regions BY region; --Count of region wise jobs region_jobs = FOREACH group_regions GENERATE group, (COUNT(usa_regions.jobscount)/(float) totaljobs.total)*100 AS percent; --Filter Non Us records filter_regions = FILTER region_jobs BY group!= 'NotUSA'; --Print output --DUMP filter_regions; -- Save output into report4_output STORE filter_regions INTO report4_output ; Report 5: This report enables the user to view the list of all companies and their respective number of job postings. -- pig -usehcatalog -f report5.pig -- Load Dice data dice = LOAD 'default.orc_dice' USING org.apache.hcatalog.pig.hcatloader(); -- From Title field, extract company names. -- The following function is efficient only when contents in the "Title" field are written in the format ("Position - Company - Location Dice posting") company_dice = FOREACH dice GENERATE SUBSTRING(REGEX_EXTRACT(title, '(- (.*?) -)', 1),2, (LAST_INDEX_OF(REGEX_EXTRACT(title, '((- (.*?) -))', 1),'-')-1)) AS company, diceid; -- Group all the data with respect to their companies. company_group = GROUP company_dice BY company; -- Show how many jobs each company had posted jobs_count = FOREACH company_group GENERATE group, COUNT(company_dice.diceid) AS jobscount; -- Sort the jobs count to show the most number of job postings by each company orderjobs = ORDER jobs_count BY jobscount DESC; -- Display top 20 comapnies top20 = LIMIT orderjobs 20; -- DUMP top20; -- Save the file to "report5_output" STORE orderjobs INTO 'report5_output';

19 Report 6: This report lists all the companies, their respective distinct job positions and their total number of distinct job positions, respectively. -- pig -usehcatalog -f report6.pig -- Load Dice data dice = LOAD 'default.orc_dice' USING org.apache.hcatalog.pig.hcatloader(); -- List the Position(From Title Filed) and Location of company position_dice = FOREACH dice GENERATE SUBSTRING(title, 0, (INDEXOF(title, '-', 1)-1)) AS position, title; -- List Position and Area(from Location) of company company_dice = FOREACH position_dice GENERATE SUBSTRING(REGEX_EXTRACT(title, '(- (.*?) -)', 1),2,(LAST_INDEX_OF(REGEX_EXTRACT(title, '((- (.*?) -))', 1),'-')-1)) AS company, position; -- Select only distinct positions by each company. distinct_jobs = DISTINCT company_dice; -- register udf REGISTER 'companypass.py' USING org.apache.pig.scripting.jython.jythonscriptengine AS companyudf; -- verify if the company from the previous result exists in the udf and for all the companies not listed in the udf display company name as "NoCompanyDetected". valid_company = FOREACH distinct_jobs GENERATE companyudf.checkifexists(company), position; -- Display all the positions that each company looking for. group_locations = GROUP valid_company BY company; -- Display the group results in tabular format; company = FOREACH group_locations GENERATE group, COUNT(valid_company.position) AS count, valid_company.position; -- Ignore all the records with company name as "NoCompanyDetected" company_position = FILTER company BY (group!= 'NoCompanyDetected') AND (group!= ''); -- List the most number positions a company posted in the top. orderpositions = ORDER company_position BY count DESC; -- Display top 20 records display20 = LIMIT orderpositions 20; -- Display output -- DUMP display20; -- Save the output into 'report6_output' STORE display20 INTO 'report6_output'; Report 7: This report lists all the distinct companies and their respective skills requirement for the job posting on dice.com. -- pig -usehcatalog -f report7.pig --Load Dice data dice = LOAD 'default.orc_dice' USING org.apache.hcatalog.pig.hcatloader();

20 -- List Company and Skills company_dice = FOREACH dice GENERATE SUBSTRING(REGEX_EXTRACT(title, '(- (.*?) -)', 1),2,(LAST_INDEX_OF(REGEX_EXTRACT(title, '((- (.*?) -))', 1),'-')-1)) AS company, skills; -- Group all companies. company_group = GROUP company_dice BY company; -- flatten skills company_skills = FOREACH company_group GENERATE group, FLATTEN(company_dice.skills) AS fskills; --DUMP company_skills; -- Save the report into report7_output. STORE company_skills INTO 'report7_output';

21 References Introduction to Hadoop: Hadoop Ecosystem: MapReduce: Regular Expressions: Pig Queries: Hive Queries: ORC File Format: HCatalog:

22 Storage Functions: LoadandStoreInterfaces Pig Loading Interface: PIG UDF: Virtualbox: MobaXterm: Amazon Web Service EMR: 5 Regions of USA: Informational-Text-more