Advanced Business Analytics using Distributed Computing (Hadoop)

Size: px
Start display at page:

Download "Advanced Business Analytics using Distributed Computing (Hadoop)"

Transcription

1 Advanced Business Analytics using Distributed Computing (Hadoop) MIS Final Project Submitted By: Mani Kumar Pantangi M - Management Information Systems Jon M. Huntsman School of Business Utah State University

2 Contents Understanding Data... 3 Problem Statement... 3 Problem Approach... 3 Hadoop Ecosystem... 3 Reports Description... 3 Data Formatting... 3 Generating Information... 4 Loading Data... 7 Running a Report... 7 Saving a Report... 7 Viewing Results Saved... 8 Archiving Files... 8 Introduction to Hadoop Terminologies... 9 Tools Introduction to Amazon Web Services (AWS) Reports Code References... 21

3 Understanding Data Data Source: Dice.com is a career website, serving information technology and engineering professionals. The dice database manages all its registered professionals and companies details. From professional s side, it includes the data that deals with person name, qualifications, soft and hard skills and many more. However, this is different from the companies data, which includes, what the position/title that company is looking for, when the job is posted, what are the skills required to hold the position, posting location, pay-rate, description of position etc.,. Problem Statement The primary problem is the format of data, which is not quite readable. And, does not carry the information in a broader perspective, as in how the jobs are with respect to location, what company is basically looking at what all distinct positions, how many positions, and how the jobs are distributed in the United States of America, what region of US has major jobs postings, what company is looking for a specific type of skill, and the questions continues. Problem Approach So, the basic problem with the data that we have is it s data format. If this can be resolved and put together in a readable format, rest of the analysis part can be done with the help of programing and querying languages. These languages help us in synthesize meaningful information from the raw data. In our problem approach we will first convert the data, which is in JSon format, to tabular format. Then on the resultant data we use Hive or PIG query languages, to construct the code, and deduce the information. Also, let us see how to create an account in Amazon Web Services and run the queries on cloud. Hadoop Ecosystem The following Hadoop tools are used to derive information from the given data. HIVE PIG Amazon Web Services (AWS) Reports Description Report 1: Data Formatting Our first and basic step is to convert the raw Json format data into readable format, i.e., to put the data in tabular format. Hive is well designed to work on tabular formats (ORC file formats). With HQL, we will designe reports to convert and upload Json format data into ORC file formats to HCatalog. Report 1a: Uploading Dice.com Json Data Data from dice.com Report 1b: Uploading usastates Json Data All US States and their respective abbreviations.

4 Generating Information Report 2: Well, we have the data now ready in readable format. However we haven t yet got any information out of it. Let s write a report to understand how the jobs are posted within USA. Kind of, identifying what percentage of jobs are posted across each state in USA. For this task we require the following operations to be performed on the data, in PIG. 1. Clean the data to remove empty records in the posted column. 2. As we are interested only in Location and Number of jobs, lets include only Dice ID and Location fields in the output, using FOREACH operator. 3. As the location column does not have the area given directly [ Logan, UT ] (About 80% data is in this format), it should be chopped to give only the states. For this, let s use a regular expressions logic to break the field and consider the data that is only after the comma using REGEX_EXTRACT operator. 4. Now we have got all area names separated from the location. These will include all areas from the world. Per our aim, we got to consider only States from USA. So let s consider joining the above output with usastates table, using JOIN operator. The output of this will be all USA states and their respective dice ids. 5. Now we have got our states and dice ids, let s group all dice ids with respect to States, using GROUP BY operator. 6. All the dice ids are now grouped, using a COUNT operator, count all dice ids. This gives us the number of jobs posted in each state. 7. However, this is not we are looking at. We need percentage of jobs posted. So, let s group all the records using GROUP ALL operator and calculate total number of jobs posted across all states in USA, using SUM function. 8. Now from total obtained above, it is easy to calculate the percentage of jobs posted in each state. Make sure to convert the data type to FLOAT when performing this operation. 9. Sort the output in descending order, using ORDER BY operator, to see the top postings state in the first position. 10. Display the output using DUMP operator. Visualization: Export the output data to Tableau, and plot the data on United States of America map. That should probably look as below:

5 Report 3: From the above output, it is clear that California has got more number of jobs posted on Dice.com. However, if I am interested in looking at the job postings region wise, the eastern side looks much denser than west. So let s perform further more operations and see how the jobs are dispersed. This time let s make use of a User Defined function (UDF), designed in Python, to split the data into 5 regions. 1. Continuing from the previous script, after generating total job count, in step 7, let s apply a UDF (usaregions.py) along with FOREACH operator, to generate 5 regions. 2. Now from regions above and total jobs in previous steps, it is easy to calculate the percentage of jobs posted in each region. Make sure to convert the datatype to float when performing this operation. 3. Sort and run the report. The output should display 5 regions and their respective percentage of jobs posted. Well, the output shows job are equally spread across Midwest Region (24%) and Southeast Region (24%). Report 4: All this while, we are looking at only US Job postings. Now I am interested in seeing the job postings of both US and Non-US places. This report makes use of a UDF and a SPLIT operator. 1. This script basically continues from Report 1 after step 3 of generating areas from location field. 2. In order to make sure we do not have any empty area fields (stale data), let s remove them using FILTER operator. 3. Now we can group each area with respect to their dice ids. This creates no duplication of records. 4. Using the COUNT operator, count all the dice ids for every area. 5. Now let us use SPLIT operator to partition the data into US_JobPosts and NonUS_JobPosts records. For this we need to make use of the UDF (areas.py). This UDF works in such a way that, when an area is passed through it, it verifies if the name is in US or not. If true, then puts the records in US_JobPosts table, else sends it to NonUS_JobPosts. 6. Run the query and view the results. 7. The output should show 2 groups, with all the USA states in US_JobPosts and Non-USA jobs posts in NonUS_JobPosts. Report 5: This report allows us to see the companies and their respective number of job postings on Dice.com. 1. From the dice data we have, there is no column for the Company. However if we closely observe the Title field, the 2 nd part is Company Name (80% accurate). 2. Using regular expressions REGEX_EXTRACT and SUBSTRING LAST_INDEX_OF functions, we can extract the company name from the Title field. 3. Generate a table Company Name and Dice ID columns using FOREACH operator. 4. Group the Companies with their dice ids using GROUP BY operator.

6 5. Count all the dice ids and then generate a table to display all the companies and their respective number of jobs posted. This report basically gives the idea of using basic string operators and from Business perspective it gives an idea of number of jobs each company posted on dice. Report 6: The previous report gives us only the number of jobs each company post on dice. However, if I am interested in looking at all the distinct positions that a company offers and number of distinct position, then we would take a different approach. The following report allows us to see list of all companies (registered in UDF), their respective job positions and total number of job positions. 1. Similar to the company name, Position is extracted from Title field. Position field is the first part of Title field. Using SUBSTRING and INDEXOF functions, Position is extracted from title field. 2. Generate a table of all companies and their respective positions. 3. Identify the distinct job positions using DISTINCT operator. (E.g.: considering the fact that a same company may post a same position in two different places) 4. Group the Companies with respect to each job posting, using GROUP BY operator. 5. Count all the distinct positions that the company looking for using COUNT operator. 6. Now, let s consider only a set of companies (~1700) list that are registered in UDF and run the UDF against the table generated in the previous step, using FOREACH operator. 7. This report basically gives us all the companies (~1700), their respective positions, and companies that are not register in UDF. Not registered companies will be filtered out using FILTER function. 8. Run the report and display output. 9. The output typically displays all the company names, number of distinct job positions and list of all job positions. The out shows, Modis is the company looking for 636 distinct positions, followed by Kforce Inc. with 596 positions across the globe. Report 7: This report is pretty basic and mainly designed to get acquainted with Amazon Web Services. We will see a detailed description of working with AWS. Coming to the report, it basically looks for each company and their respective skills requirement for all the jobs posted on dice. 1. As this is run on AWS, we have a slight different approach of loading data. A sample set of large data is gathered, and then loaded on to AWS via MobaXterm. 2. We are mainly focusing on Company and their Skills requirements. Knowing the fact that, we do not have a company field as such in the data base, we use various string operations to extract company name from the Title field, using SUBSTRING, REGEX_EXTRACT and LAST_INDEX_OF functions and get the records using FOREACH operator. 3. Now we have all the companies and their respective skills for particular job postings. But we are looking for all distinct companies and their respective skills. Hence, we should group all the companies using GROUP BY operator.

7 4. The GROUP BY operator generates a bag of values (skills) along with Key (company name). To make the result more sense, let us use a FLATTEN operator to pull all the skills from bag and display them as a normal words. 5. The output shows all the distinct companies and their respective skills requirement as below: BRiCK House Specialty Resources - HTML 5; CSS; Javascript and JQuery Loading Data Loading Data into Pig: HCatLoader interface is used within Pig scripts to read data from HCatalog-managed tables. The first interaction with HCatLoader happens while Pig is parsing the query and generating the logical plan. HCatLoader.getSchema is called, which causes HCatalog to query the HiveMetaStore for the table schema, which is used as the schema for all records in this table. Use LOAD command to load data into alias name, specify the table name in single quotes and then load using HCatLoader() interface. dice = LOAD 'default.orc_dice' USING org.apache.hcatalog.pig.hcatloader(); Running a Report Hive Reports: If you have partitions on the data that is to be invoked to HCatalog, then run the following command: hive -hiveconf FILEDATE=' ' -f /root/project_mpantangi/orcfile.hql is the data partition. /root/project_mpantangi/orcfile.hql is the directory where the file is located on. If no partitions, then use the following: hive -f /root/project_mpantangi/report1.hql Pig Reports: Pig does not automatically pick up HCatalog jars. To bring in the necessary jars, you should use a flag in the pig command on command line, as below: pig usehcatalog -f report5.pig pig usehcatalog flag ensures that all the jar files required for executing the report are ready. Saving a Report Pig Reports: Use STORE function to store output. HCatStorer() interface can be used, if you would like to write data to HCatalog-managed tables.

8 STORE dicetop20 INTO 'report5_output' USING org.apache.hcatalog.pig.hcatstorer(); Note: if no interface specified, the output will be saved as a text file. STORE dicetop20 INTO 'report5_output'; Viewing Results Saved The files are saved on HDFS, to access the files, write the following sequence of commands on the command line. 1. View all the files saved on HDFS hdfs dfs ls 2. Once you see the folder that you are interested in, enter the following command. hdfs dfs -ls reportx_output 3. The folder may show you list of files in it, your output will be ideally saved in the part-r-0000 file. hdfs dfs -cat reportx_output/part-r View the contents of the file using the below command and save it in any file on your local machine, if required: hdfs dfs -cat report2_output/part-r > report2_output.csv Archiving Files In UNIX, the name of the tar command is short for tape archiving. Common use for tar is to simply combine a few files into a single file, for easy storage and distribution. To combine multiple files and/or directories into a single file, use the following command: tar cvfp mpantangi.tar project_mpantangi cvfp Creates a tar file. mpantangi.tar File name.tar project_mpantangi Directory being archived. Similarly, to extract files from an archived folder, we use the following command: tar xvfp mpantangi.tar xvfp Extracts files from archived folder. mpantangi.tar Archived file.

9 Introduction to Hadoop Terminologies HDFS: The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines, HDFS is designed to be fault-tolerant due to data replication and distribution of data. When a file is loaded into HDFS, it is replicated and broken up into "blocks" of data, which are stored across the cluster nodes designated for storage, a.k.a. DataNodes. HDFS requires a NameNode process to run on one node in the cluster and a DataNode service to run on each "slave" node that will be processing data. The NameNode is responsible for storage and management of metadata, so that when MapReduce or another execution framework calls for the data, the NameNode informs it where the needed data resides. MapReduce: MapReduce is a programming model for data processing. Hadoop can run MapReduce programs written in various languages. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. An example for MR process below: HCatalog: HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools Pig, MapReduce, and Hive to more easily read and write data on the grid. HCatalog s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile formats. Pig: Pig is a high level scripting language that is used with Apache Hadoop. Pig runs on Hadoop and makes use of MapReduce and the Hadoop Distributed File System (HDFS). The language for the platform is called Pig Latin. Pig translates the Pig Latin script into MapReduce so that it can be executed within Hadoop. Pig Latin is a flow language whereas SQL is a declarative language. SQL is great for asking a question of your data, while Pig Latin allows you to write a data flow that describes how your data will be transformed. Pig latin can become extensible when introduced with user defined functions using Java, Python, Ruby, or other scripting languages. HIVE: The tables in Hive are similar to tables in a relational database. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data. Hive s SQL dialect, called HiveQL, does not support the full SQL-92 specification. Furthermore, Hive has some extensions that are not in SQL-92, inspired by syntax from other

10 database systems, notably MySQL. In fact, to a first-order approximation, HiveQL most closely resembles MySQL s SQL dialect. Data analysts use Hive to explore, structure and analyze that data, then turn it into business insight. ORC File Format: The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data. Json File Format: The JavaScript Object Notation (JSon) file format is just like XML. This is lightweight data interchange format, "self-describing" and easy to understand and language independent. This text can be read and used as a data format by any programming language. UDF: As discussed earlier PIG can become extensible if user defined functions are used. Pig UDFs can currently be implemented in six languages: Java, Jython, Python, JavaScript, Ruby and Groovy. A UDF is function that runs every time when a Pig query is executed. Pig also provides support for Piggy Bank, a repository for Java UDFs. Note: Before using any UDF it must be registered. Registering a UDF: You should register a Jython script as shown below before using the UDF in your PIG scripts. This example uses to interpret the Jython script. Currently, Pig identifies jython as a keyword and ships the required scriptengine (jython) to interpret it. REGISTER 'companypass.py' USING org.apache.pig.scripting.jython.jythonscriptengine AS companyudf; Dice Data: Sample Dice data format and list of fields available for consideration. { "all": [], "Description": ["<div id=\"detaildescription\"><p>the opportunity is with predictive analytics team. The primary responsibility of this position is to provide technical assistance by performing data migration, data analytics, economic modelling, simulation, production analytics and reporting. Additional responsibilities include data cleansing, data analysis, project planning and management, developing custom workflows, and presenting and discussing discoveries. </p></div>"], "Title": ["Data Scientist - Oxyprime LLC - Florham Park, NJ dice.com "], "Skills": ["SQL, MS Access and Excel, Statistics, SSIS, SAP Business objects, Basic programming skills (C++, Java, Visual Basic,.net development), Analytical skills\u00a0"], "Pay_Rate": ["DOE\u00a0"], "Area_Code": ["973\u00a0"], "Telecommute": ["no\u00a0"], "Position_ID": ["902IDS\u00a0"], "Length": ["12+ months\u00a0"], "Link": " "Location": ["Florham Park, NJ"], "Dice_ID": [" \u00a0"], "Tax_Term": ["CON_CORP\u00a0"], "Posted": [" \u00a0"], "Travel_Req": ["none\u00a0"] }

11 Tools Oracle VM VirtualBox: Version VirtualBox is a cross-platform virtualization application. For one thing, it installs on your existing Intel or AMD-based computers, whether they are running Windows, Mac, Linux or Solaris operating systems. Secondly, it extends the capabilities of your existing computer so that it can run multiple operating systems (inside multiple virtual machines) at the same time. So, for example, you can run Windows and Linux on your Mac, run Windows Server 2008 on your Linux server, run Linux on your Windows PC, and so on, all alongside your existing applications. You can install and run as many virtual machines as you like. Hartonworks Sandbox: A single-node Hadoop cluster, running in a virtual machine, implementation of the Hortonworks Data Platform (HDP). It is packaged as a virtual machine to make evaluation and experimentation with HDP fast and easy. MobaXterm: MobaXterm is an enhanced terminal for Windows. MobaXterm brings all the essential UNIX commands to Windows desktop, in a single portable exe file which works out of the box. Tableau: Tableau Desktop is data analysis and visualization tool. Amazon Web Services (AWS): AWS is cloud computing through Amazon Web Services.

12 Introduction to Amazon Web Services (AWS) Recently, virtualization has become a widely accepted way to reduce operating costs and increase the reliability of enterprise IT. In addition, grid computing makes a completely new class of analytics, data crunching, and business intelligence tasks possible that were previously cost and time prohibitive. Along with these technology changes, the speed of innovation and unprecedented acceleration in the introduction of new products has fundamentally changed the way markets work. Along with the wide acceptance of software as a service (SaaS) offerings, these changes have paved the way for the latest IT infrastructure challenge: cloud computing. Amazon Web Services (AWS) is a collection of remote computing services, also called web services that together make up a cloud computing platform by Amazon.com. The main purpose and advantage for a large organization with AWS is, the investment on hardware, software components can be reduced. Highly scalable, can ramp up the resource when in need and scale down when we can compute on available resource. We would require a Virtual Machine, Storage Space and Hadoop code to run our queries. The following 3 components of AWS serves our purpose. EC2 - Virtual Machine - Elastic Cloud Compute Amazon S3 - Storage Space - Simple Storage Service Amazon EMR - Amazon Hadoop - Elastic MapReduce Creating AWS Cluster: 1. Create an account on AWS at (To create an account you need to provide a valid credit card information). 2. From the AWS services select EC2 and create a new key pair. 3. Download and Save the PEM file generated. 4. From the AWS Services, select IAM to create user accounts. 5. Create User Accounts and Save the user credentials.

13 6. From the AWS Services, select EMR and create a Cluster. 7. When the setup is being in process, the cluster status transitions as following: STARTING The cluster provisions, starts, and configures EC2 instances. BOOTSTRAPPING Bootstrap actions are being executed on the cluster. RUNNING A step for the cluster is currently being run. WAITING The cluster is currently active, but has no steps to run. 8. In the waiting phase, turn on MobaXterm and connect to the Amazon Web Services. 9. On the SSH connection page give the following details: Remote Host: ec us-west-2.compute.amazonaws.com (generated during step 7 Master Public DNS) Specify Username: Hadoop 10. In the Advanced SSH Settings, select Use Private Key and file browse for the PEM file generated in step MobaXterm now communicates between AWS and you. 12. Performing all data manipulation operations through MobaXterm. 13. Soon after the work is done, terminate the session on AWS. 14. Go back to AWS Services and select EMR. On the EMR page click on Terminate button to terminate the cluster. The following message displayed on MobaXterm once the cluster is terminated. 15. The following possible messages will be displayed on the EMR Screen during/after termination. TERMINATING - The cluster is in the process of shutting down. TERMINATED - The cluster was shut down without error. TERMINATED_WITH_ERRORS - The cluster was shut down with errors.

14 Reports Code Report 1: This report imports the raw Jason format data into Hive tables using HQL. Report 1a: -- hive -hiveconf FILEDATE=' ' -f /root/project_mpantangi/orcfile.hql -- Read file in JSON format -- Save file in RCfile format -- Example: FILEDATE=' ' DROP TABLE json_dice; CREATE TABLE json_dice (jrecord string); -- default file is a comma-delimited text file DROP TABLE orc_dice; CREATE TABLE IF NOT EXISTS orc_dice ( posted string, title string, skills string, areacode string, location string, payrate string, positionid string, diceid string, length string, taxterm string, travelreq string, telecommute string, link string, description string ) PARTITIONED BY(scrapedate string) STORED AS ORC; LOAD DATA LOCAL INPATH "/root/project_mpantangi/dice.${hiveconf:filedate}.json" INTO TABLE json_dice; ALTER TABLE orc_dice DROP IF EXISTS PARTITION(scrapedate = "${hiveconf:filedate}"); INSERT OVERWRITE TABLE orc_dice PARTITION (scrapedate="${hiveconf:filedate}") SELECT regexp_extract(get_json_object(json_dice.jrecord, '$.Posted'), '\\["(.*)\\u00a0"\\]', 1) as posted, regexp_extract(get_json_object(json_dice.jrecord, '$.Title'), '\\["(.*)"\\]', 1) as title, regexp_extract(get_json_object(json_dice.jrecord, '$.Skills'), '\\["(.*)\\u00a0"\\]', 1) as skills, regexp_extract(get_json_object(json_dice.jrecord, '$.Area_Code'), '\\["(.*)\\u00a0"\\]', 1) as areacode, regexp_extract(get_json_object(json_dice.jrecord, '$.Location'), '\\["(.*)"\\]', 1) as location, regexp_extract(get_json_object(json_dice.jrecord, '$.Pay_Rate'), '\\["(.*)\\u00a0"\\]', 1) as payrate, regexp_extract(get_json_object(json_dice.jrecord, '$.Position_ID'), '\\["(.*)\\u00a0"\\]', 1) as positionid, regexp_extract(get_json_object(json_dice.jrecord, '$.Dice_ID'), '\\["(.*)\\u00a0"\\]', 1) as diceid,

15 regexp_extract(get_json_object(json_dice.jrecord, '$.Length'), '\\["(.*)\\u00a0"\\]', 1) as length, regexp_extract(get_json_object(json_dice.jrecord, '$.Tax_Term'), '\\["(.*)\\u00a0"\\]', 1) as taxterm, regexp_extract(get_json_object(json_dice.jrecord, '$.Travel_Req'), '\\["(.*)\\u00a0"\\]', 1) as travelreq, regexp_extract(get_json_object(json_dice.jrecord, '$.Telecommute'), '\\["(.*)\\u00a0"\\]', 1) as telecommute, get_json_object(json_dice.jrecord, '$.Link') as link, regexp_extract(get_json_object(json_dice.jrecord, '$.Description'), '\\["(.{200}).*"\\]', 1) as description FROM json_dice; Report 1b: -- hive -f /root/project_mpantangi/report1.hql -- Read file in JSON format -- Save file in RCfile format DROP TABLE json_usa_states; CREATE TABLE json_usa_states ( jrecord string ); -- default file is a comma-delimited text file DROP TABLE usa_states; CREATE TABLE usa_states ( States string, Abbreviation string ) STORED AS ORC; LOAD DATA LOCAL INPATH "/root/project_mpantangi/usa_states.json" INTO TABLE json_usa_states; INSERT OVERWRITE TABLE usa_states SELECT get_json_object(json_usa_states.jrecord, '$.States') as States, get_json_object(json_usa_states.jrecord, '$.Abbreviation') as Abbreviation FROM json_usa_states; Report 2: This report mainly focuses on filtering US states from dice data and reporting their respective percentage of jobs posting with respect to overall US. -- pig -usehcatalog -f report2.pig -- Load Dice data dice = LOAD 'default.orc_dice' USING org.apache.hcatalog.pig.hcatloader(); -- Load United States of America states data usastates = LOAD 'default.usa_states' USING org.apache.hcatalog.pig.hcatloader(); -- FILTER out the first line of file to avoid errors with pig 0.12 dice_filtered = FILTER dice BY ((posted IS NOT NULL) AND (posted!= '')); -- Generate location and dice id columns from the whole data dice_location = FOREACH dice_filtered GENERATE location, diceid;

16 -- From Location field, extract state names and make sure no stale data(special characters, single characters etc) are displayed. dice_area = FOREACH dice_location GENERATE REGEX_EXTRACT(location, '((?<=, ).*?([^\n]+))', 1) AS area, diceid; -- The regular expression above, will display all the states/areas from the location field that is after comma. (Eg: Mumbai from [Mumbai, India] or UT from [Logan, UT]) -- Group area with job postings area_jobs = GROUP dice_area BY area; -- Show how many jobs each area have been posted with jobs_count = FOREACH area_jobs GENERATE group, COUNT(dice_area.diceid) AS jobscount; -- Filterout jobs posted in United States of America usa_jobposts = JOIN jobs_count BY group, usastates BY abbreviation; -- Display State Name, State Abbreviation and Jobscount state_jobs = FOREACH usa_jobposts GENERATE states, jobscount; --Group all sum_jobs = GROUP state_jobs ALL; -- generate sum total_jobs = FOREACH sum_jobs GENERATE SUM(state_jobs.jobscount) as totaljobs; -- percent jobs_percent = FOREACH state_jobs GENERATE states, (jobscount/(float)total_jobs.totaljobs)*100 AS jobpercent; -- Sort the jobs count to show the most number of job postings in each state job_topstates = ORDER jobs_percent BY jobpercent DESC; -- Display top 5 states --top5 = LIMIT job_topstates 5; -- Display the output, the result should show all the states and the number of jobs posted. --DUMP top5; -- Display all the states in US and their respective number of postings. --DUMP job_topstates; -- Save the file to "report2_output" STORE job_topstates INTO 'Report2_output'; Report 3: This report divides the data set into US and Non-US job postings and display number of jobs posted in each area respectively. --pig -usehcatalog -f report3.pig -- Load Dice data dice = LOAD 'default.orc_dice' USING org.apache.hcatalog.pig.hcatloader(); -- Generate location and dice id columns from the whole data dice_location = FOREACH dice GENERATE location, diceid;

17 -- From Location field, extract state names and make sure no stale data(special characters, single characters etc) are displayed. dice_area = FOREACH dice_location GENERATE REGEX_EXTRACT(location, '((?<=, ).*?([^\n]+))', 1) AS area, diceid; -- Filter records with no location/area useable_data = FILTER dice_area BY (area!= '') AND (area IS NOT NULL); -- Group area with job postings area_jobs = GROUP useable_data BY area; -- Show how many jobs each area have been posted with jobs_count = FOREACH area_jobs GENERATE group, COUNT(useable_data.diceid) AS jobscount; -- register udf REGISTER 'areas.py' USING org.apache.pig.scripting.jython.jythonscriptengine AS areaudf; --segregate the job postings into US and NonUS SPLIT jobs_count INTO US_JobPosts IF group == areaudf.checkifexists(group), NonUS_jobPosts IF group!= areaudf.checkifexists(group); -- Display total number of US posting and Non US Postings --DUMP US_JobPosts; --DUMP NonUS_jobPosts; -- Save the files into 'report3a_output' and 'report4a_output' STORE US_JobPosts INTO 'report3a_output'; STORE NonUS_jobPosts INTO 'report3b_output'; Report 4: This report shows the percentage of jobs spread across in each of 5 regions of US. -- pig -usehcatalog -f report4.pig -- Load Dice data dice = LOAD 'default.orc_dice' USING org.apache.hcatalog.pig.hcatloader(); -- Generate location and dice id columns from the whole data dice_location = FOREACH dice GENERATE location, diceid; -- From Location field, extract state names and make sure no stale data(special characters, single characters etc) are displayed. dice_area = FOREACH dice_location GENERATE REGEX_EXTRACT(Location, '((?<=, ).*?([^\n]+))', 1) AS area, diceid; -- Group area with job postings area_jobs = GROUP dice_area BY area; -- Show how many jobs each area have been posted with jobs_count = FOREACH area_jobs GENERATE group, COUNT(dice_area.diceid) AS jobscount; --group all areas groupall = GROUP jobs_count ALL; -- Calcualte the total jobs count totaljobs = FOREACH groupall GENERATE SUM(jobs_count.jobscount) AS total;

18 -- register udf REGISTER 'usaregions.py' USING org.apache.pig.scripting.jython.jythonscriptengine AS regionudf; --regions usa_regions = FOREACH jobs_count GENERATE regionudf.checkifexists(group) AS region, jobscount; --group regions group_regions = GROUP usa_regions BY region; --Count of region wise jobs region_jobs = FOREACH group_regions GENERATE group, (COUNT(usa_regions.jobscount)/(float) totaljobs.total)*100 AS percent; --Filter Non Us records filter_regions = FILTER region_jobs BY group!= 'NotUSA'; --Print output --DUMP filter_regions; -- Save output into report4_output STORE filter_regions INTO report4_output ; Report 5: This report enables the user to view the list of all companies and their respective number of job postings. -- pig -usehcatalog -f report5.pig -- Load Dice data dice = LOAD 'default.orc_dice' USING org.apache.hcatalog.pig.hcatloader(); -- From Title field, extract company names. -- The following function is efficient only when contents in the "Title" field are written in the format ("Position - Company - Location Dice posting") company_dice = FOREACH dice GENERATE SUBSTRING(REGEX_EXTRACT(title, '(- (.*?) -)', 1),2, (LAST_INDEX_OF(REGEX_EXTRACT(title, '((- (.*?) -))', 1),'-')-1)) AS company, diceid; -- Group all the data with respect to their companies. company_group = GROUP company_dice BY company; -- Show how many jobs each company had posted jobs_count = FOREACH company_group GENERATE group, COUNT(company_dice.diceid) AS jobscount; -- Sort the jobs count to show the most number of job postings by each company orderjobs = ORDER jobs_count BY jobscount DESC; -- Display top 20 comapnies top20 = LIMIT orderjobs 20; -- DUMP top20; -- Save the file to "report5_output" STORE orderjobs INTO 'report5_output';

19 Report 6: This report lists all the companies, their respective distinct job positions and their total number of distinct job positions, respectively. -- pig -usehcatalog -f report6.pig -- Load Dice data dice = LOAD 'default.orc_dice' USING org.apache.hcatalog.pig.hcatloader(); -- List the Position(From Title Filed) and Location of company position_dice = FOREACH dice GENERATE SUBSTRING(title, 0, (INDEXOF(title, '-', 1)-1)) AS position, title; -- List Position and Area(from Location) of company company_dice = FOREACH position_dice GENERATE SUBSTRING(REGEX_EXTRACT(title, '(- (.*?) -)', 1),2,(LAST_INDEX_OF(REGEX_EXTRACT(title, '((- (.*?) -))', 1),'-')-1)) AS company, position; -- Select only distinct positions by each company. distinct_jobs = DISTINCT company_dice; -- register udf REGISTER 'companypass.py' USING org.apache.pig.scripting.jython.jythonscriptengine AS companyudf; -- verify if the company from the previous result exists in the udf and for all the companies not listed in the udf display company name as "NoCompanyDetected". valid_company = FOREACH distinct_jobs GENERATE companyudf.checkifexists(company), position; -- Display all the positions that each company looking for. group_locations = GROUP valid_company BY company; -- Display the group results in tabular format; company = FOREACH group_locations GENERATE group, COUNT(valid_company.position) AS count, valid_company.position; -- Ignore all the records with company name as "NoCompanyDetected" company_position = FILTER company BY (group!= 'NoCompanyDetected') AND (group!= ''); -- List the most number positions a company posted in the top. orderpositions = ORDER company_position BY count DESC; -- Display top 20 records display20 = LIMIT orderpositions 20; -- Display output -- DUMP display20; -- Save the output into 'report6_output' STORE display20 INTO 'report6_output'; Report 7: This report lists all the distinct companies and their respective skills requirement for the job posting on dice.com. -- pig -usehcatalog -f report7.pig --Load Dice data dice = LOAD 'default.orc_dice' USING org.apache.hcatalog.pig.hcatloader();

20 -- List Company and Skills company_dice = FOREACH dice GENERATE SUBSTRING(REGEX_EXTRACT(title, '(- (.*?) -)', 1),2,(LAST_INDEX_OF(REGEX_EXTRACT(title, '((- (.*?) -))', 1),'-')-1)) AS company, skills; -- Group all companies. company_group = GROUP company_dice BY company; -- flatten skills company_skills = FOREACH company_group GENERATE group, FLATTEN(company_dice.skills) AS fskills; --DUMP company_skills; -- Save the report into report7_output. STORE company_skills INTO 'report7_output';

21 References Introduction to Hadoop: Hadoop Ecosystem: MapReduce: Regular Expressions: Pig Queries: Hive Queries: ORC File Format: HCatalog:

22 Storage Functions: LoadandStoreInterfaces Pig Loading Interface: PIG UDF: Virtualbox: MobaXterm: Amazon Web Service EMR: 5 Regions of USA: Informational-Text-more

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an

More information

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015 COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt

More information

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei CSE 344 Introduction to Data Management Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei Homework 8 Big Data analysis on billion triple dataset using Amazon Web Service (AWS) Billion Triple Set: contains

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759

More information

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros David Moses January 2014 Paper on Cloud Computing I Background on Tools and Technologies in Amazon Web Services (AWS) In this paper I will highlight the technologies from the AWS cloud which enable you

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

Introduction To Hive

Introduction To Hive Introduction To Hive How to use Hive in Amazon EC2 CS 341: Project in Mining Massive Data Sets Hyung Jin(Evion) Kim Stanford University References: Cloudera Tutorials, CS345a session slides, Hadoop - The

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

CLOUD STORAGE USING HADOOP AND PLAY

CLOUD STORAGE USING HADOOP AND PLAY 27 CLOUD STORAGE USING HADOOP AND PLAY Devateja G 1, Kashyap P V B 2, Suraj C 3, Harshavardhan C 4, Impana Appaji 5 1234 Computer Science & Engineering, Academy for Technical and Management Excellence

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce. Tushar B. Kute, http://tusharkute.com MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

The full setup includes the server itself, the server control panel, Firebird Database Server, and three sample applications with source code.

The full setup includes the server itself, the server control panel, Firebird Database Server, and three sample applications with source code. Content Introduction... 2 Data Access Server Control Panel... 2 Running the Sample Client Applications... 4 Sample Applications Code... 7 Server Side Objects... 8 Sample Usage of Server Side Objects...

More information

Integrating VoltDB with Hadoop

Integrating VoltDB with Hadoop The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.

More information

How To Create A Large Data Storage System

How To Create A Large Data Storage System UT DALLAS Erik Jonsson School of Engineering & Computer Science Secure Data Storage and Retrieval in the Cloud Agenda Motivating Example Current work in related areas Our approach Contributions of this

More information

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) Use Data from a Hadoop Cluster with Oracle Database Hands-On Lab Lab Structure Acronyms: OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) All files are

More information

Big Data Weather Analytics Using Hadoop

Big Data Weather Analytics Using Hadoop Big Data Weather Analytics Using Hadoop Veershetty Dagade #1 Mahesh Lagali #2 Supriya Avadhani #3 Priya Kalekar #4 Professor, Computer science and Engineering Department, Jain College of Engineering, Belgaum,

More information

Creating a universe on Hive with Hortonworks HDP 2.0

Creating a universe on Hive with Hortonworks HDP 2.0 Creating a universe on Hive with Hortonworks HDP 2.0 Learn how to create an SAP BusinessObjects Universe on top of Apache Hive 2 using the Hortonworks HDP 2.0 distribution Author(s): Company: Ajay Singh

More information

What's New in SAS Data Management

What's New in SAS Data Management Paper SAS034-2014 What's New in SAS Data Management Nancy Rausch, SAS Institute Inc., Cary, NC; Mike Frost, SAS Institute Inc., Cary, NC, Mike Ames, SAS Institute Inc., Cary ABSTRACT The latest releases

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Data Domain Profiling and Data Masking for Hadoop

Data Domain Profiling and Data Masking for Hadoop Data Domain Profiling and Data Masking for Hadoop 1993-2015 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

How To Use Hadoop

How To Use Hadoop Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

More information

This is a brief tutorial that explains the basics of Spark SQL programming.

This is a brief tutorial that explains the basics of Spark SQL programming. About the Tutorial Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Big Data Spatial Analytics An Introduction

Big Data Spatial Analytics An Introduction 2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Big Data Spatial Analytics An Introduction Marwa Mabrouk Mansour Raad Esri iu UC2013. Technical Workshop

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Leveraging SAP HANA & Hortonworks Data Platform to analyze Wikipedia Page Hit Data

Leveraging SAP HANA & Hortonworks Data Platform to analyze Wikipedia Page Hit Data Leveraging SAP HANA & Hortonworks Data Platform to analyze Wikipedia Page Hit Data 1 Introduction SAP HANA is the leading OLTP and OLAP platform delivering instant access and critical business insight

More information

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt Big Data Analytics in LinkedIn by Danielle Aring & William Merritt 2 Brief History of LinkedIn - Launched in 2003 by Reid Hoffman (https://ourstory.linkedin.com/) - 2005: Introduced first business lines

More information

Lightweight Stack for Big Data Analytics. Department of Computer Science University of St Andrews

Lightweight Stack for Big Data Analytics. Department of Computer Science University of St Andrews Lightweight Stack for Big Data Analytics Muhammad Asif Saleem Dissertation 2014 Erasmus Mundus MSc in Dependable Software Systems Department of Computer Science University of St Andrews A dissertation

More information

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today Yahoo! Grid Services Where Grid Computing at Yahoo! is Today Marco Nicosia Grid Services Operations marco@yahoo-inc.com What is Apache Hadoop? Distributed File System and Map-Reduce programming platform

More information

ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Hadoop and Big Data Research

Hadoop and Big Data Research Jive with Hive Allan Mitchell Joint author on 2005/2008 SSIS Book by Wrox Websites www.copperblueconsulting.com Specialise in Data and Process Integration Microsoft SQL Server MVP Twitter: allansqlis E:

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Cloudera Manager Training: Hands-On Exercises

Cloudera Manager Training: Hands-On Exercises 201408 Cloudera Manager Training: Hands-On Exercises General Notes... 2 In- Class Preparation: Accessing Your Cluster... 3 Self- Study Preparation: Creating Your Cluster... 4 Hands- On Exercise: Working

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Hive Interview Questions

Hive Interview Questions HADOOPEXAM LEARNING RESOURCES Hive Interview Questions www.hadoopexam.com Please visit www.hadoopexam.com for various resources for BigData/Hadoop/Cassandra/MongoDB/Node.js/Scala etc. 1 Professional Training

More information

Hadoop Basics with InfoSphere BigInsights

Hadoop Basics with InfoSphere BigInsights An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Part: 1 Exploring Hadoop Distributed File System An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government

More information

ITG Software Engineering

ITG Software Engineering Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

Important Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved.

Important Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved. Hue 2 User Guide Important Notice (c) 2010-2013 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this document

More information

HADOOP. Revised 10/19/2015

HADOOP. Revised 10/19/2015 HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...

More information

Big Data Analytics by Using Hadoop

Big Data Analytics by Using Hadoop Governors State University OPUS Open Portal to University Scholarship All Capstone Projects Student Capstone Projects Spring 2015 Big Data Analytics by Using Hadoop Chaitanya Arava Governors State University

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

AWS Schema Conversion Tool. User Guide Version 1.0

AWS Schema Conversion Tool. User Guide Version 1.0 AWS Schema Conversion Tool User Guide AWS Schema Conversion Tool: User Guide Copyright 2016 Amazon Web Services, Inc. and/or its affiliates. All rights reserved. Amazon's trademarks and trade dress may

More information

Sisense. Product Highlights. www.sisense.com

Sisense. Product Highlights. www.sisense.com Sisense Product Highlights Introduction Sisense is a business intelligence solution that simplifies analytics for complex data by offering an end-to-end platform that lets users easily prepare and analyze

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE Anjali P P 1 and Binu A 2 1 Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi. M G University, Kerala

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

File S1: Supplementary Information of CloudDOE

File S1: Supplementary Information of CloudDOE File S1: Supplementary Information of CloudDOE Table of Contents 1. Prerequisites of CloudDOE... 2 2. An In-depth Discussion of Deploying a Hadoop Cloud... 2 Prerequisites of deployment... 2 Table S1.

More information

Single Node Hadoop Cluster Setup

Single Node Hadoop Cluster Setup Single Node Hadoop Cluster Setup This document describes how to create Hadoop Single Node cluster in just 30 Minutes on Amazon EC2 cloud. You will learn following topics. Click Here to watch these steps

More information

Apache Sentry. Prasad Mujumdar prasadm@apache.org prasadm@cloudera.com

Apache Sentry. Prasad Mujumdar prasadm@apache.org prasadm@cloudera.com Apache Sentry Prasad Mujumdar prasadm@apache.org prasadm@cloudera.com Agenda Various aspects of data security Apache Sentry for authorization Key concepts of Apache Sentry Sentry features Sentry architecture

More information

Using EMC Documentum with Adobe LiveCycle ES

Using EMC Documentum with Adobe LiveCycle ES Technical Guide Using EMC Documentum with Adobe LiveCycle ES Table of contents 1 Deployment 3 Managing LiveCycle ES development assets in Documentum 5 Developing LiveCycle applications with contents in

More information

Evaluation Checklist Data Warehouse Automation

Evaluation Checklist Data Warehouse Automation Evaluation Checklist Data Warehouse Automation March 2016 General Principles Requirement Question Ajilius Response Primary Deliverable Is the primary deliverable of the project a data warehouse, or is

More information

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer Automated Data Ingestion Bernhard Disselhoff Enterprise Sales Engineer Agenda Pentaho Overview Templated dynamic ETL workflows Pentaho Data Integration (PDI) Use Cases Pentaho Overview Overview What we

More information

MySQL and Hadoop. Percona Live 2014 Chris Schneider

MySQL and Hadoop. Percona Live 2014 Chris Schneider MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for

More information

Introduction to Apache Pig Indexing and Search

Introduction to Apache Pig Indexing and Search Large-scale Information Processing, Summer 2014 Introduction to Apache Pig Indexing and Search Emmanouil Tzouridis Knowledge Mining & Assessment Includes slides from Ulf Brefeld: LSIP 2013 Organizational

More information

User Guide. Analytics Desktop Document Number: 09619414

User Guide. Analytics Desktop Document Number: 09619414 User Guide Analytics Desktop Document Number: 09619414 CONTENTS Guide Overview Description of this guide... ix What s new in this guide...x 1. Getting Started with Analytics Desktop Introduction... 1

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Forensic Clusters: Advanced Processing with Open Source Software. Jon Stewart Geoff Black

Forensic Clusters: Advanced Processing with Open Source Software. Jon Stewart Geoff Black Forensic Clusters: Advanced Processing with Open Source Software Jon Stewart Geoff Black Who We Are Mac Lightbox Guidance alum Mr. EnScript C++ & Java Developer Fortune 100 Financial NCIS (DDK/ManTech)

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop

More information

The Inside Scoop on Hadoop

The Inside Scoop on Hadoop The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. Orion.Gebremedhin@Neudesic.COM B-orgebr@Microsoft.com @OrionGM The Inside Scoop

More information

Course Scheduling Support System

Course Scheduling Support System Course Scheduling Support System Roy Levow, Jawad Khan, and Sam Hsu Department of Computer Science and Engineering, Florida Atlantic University Boca Raton, FL 33431 {levow, jkhan, samh}@fau.edu Abstract

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Data Warehouse Center Administration Guide

Data Warehouse Center Administration Guide IBM DB2 Universal Database Data Warehouse Center Administration Guide Version 8 SC27-1123-00 IBM DB2 Universal Database Data Warehouse Center Administration Guide Version 8 SC27-1123-00 Before using this

More information

Getting Started with Amazon EC2 Management in Eclipse

Getting Started with Amazon EC2 Management in Eclipse Getting Started with Amazon EC2 Management in Eclipse Table of Contents Introduction... 4 Installation... 4 Prerequisites... 4 Installing the AWS Toolkit for Eclipse... 4 Retrieving your AWS Credentials...

More information

Getting Started with Hadoop with Amazon s Elastic MapReduce

Getting Started with Hadoop with Amazon s Elastic MapReduce Getting Started with Hadoop with Amazon s Elastic MapReduce Scott Hendrickson scott@drskippy.net http://drskippy.net/projects/emr-hadoopmeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrickson

More information

Data Semantics Aware Cloud for High Performance Analytics

Data Semantics Aware Cloud for High Performance Analytics Data Semantics Aware Cloud for High Performance Analytics Microsoft Future Cloud Workshop 2011 June 2nd 2011, Prof. Jun Wang, Computer Architecture and Storage System Laboratory (CASS) Acknowledgement

More information