Introduction to Hadoop on the cloud using BigInsights on BlueMix dev@pulse, Feb. 24-25, 2014

Similar documents

Hadoop Basics with InfoSphere BigInsights

Using the Bluemix Analytics for Hadoop Service to Analyse Data

IBM Software Hadoop Fundamentals

Hadoop Basics with InfoSphere BigInsights

Hadoop Basics with InfoSphere BigInsights

Novell ZENworks Asset Management 7.5

Exploring your InfoSphere BigInsights cluster and sample applications

IBM Operational Decision Manager Version 8 Release 5. Getting Started with Business Rules

Oracle Fusion Middleware

Introduction to Microsoft Access 2003

Creating a universe on Hive with Hortonworks HDP 2.0

Abstract. For notes detailing the changes in each release, see the MySQL for Excel Release Notes. For legal information, see the Legal Notices.

Plug-In for Informatica Guide

FOR WINDOWS FILE SERVERS

ORACLE BUSINESS INTELLIGENCE WORKSHOP

ORACLE BUSINESS INTELLIGENCE WORKSHOP

Business Intelligence Tutorial

Scribe Online Integration Services (IS) Tutorial

EMC Smarts Network Configuration Manager

Business Insight Report Authoring Getting Started Guide

Demo Summary. Big Data for Social Good Example Demo 1

WebSphere Business Monitor V6.2 KPI history and prediction lab

Create an Excel BI report and share on SharePoint 2013

Business Process Management IBM Business Process Manager V7.5

Create a Database Driven Application

BID2WIN Workshop. Advanced Report Writing

Ajera 7 Installation Guide

ACCESS Importing and Exporting Data Files. Information Technology. MS Access 2007 Users Guide. IT Training & Development (818)

WebSphere Business Monitor V7.0 Business space dashboards

NetIQ Operations Center 5: The Best IT Management Tool in the World Lab

WEBFOCUS QUICK DATA FOR EXCEL

IBM Configuring Rational Insight and later for Rational Asset Manager

Quick Start Guide. Highly customizable automated trading Automate your trades according to rules and models you create.

Taleo Enterprise. Taleo Reporting Getting Started with Business Objects XI3.1 - User Guide

Upgrading from Call Center Reporting to Reporting for Contact Center. BCM Contact Center

Learn About Analysis, Interactive Reports, and Dashboards

Business Intelligence Tutorial: Introduction to the Data Warehouse Center

Adobe Summit 2015 Lab 718: Managing Mobile Apps: A PhoneGap Enterprise Introduction for Marketers

User Guide. Analytics Desktop Document Number:

SFTP Server User Login Instructions. Open Internet explorer and enter the following url:

Using IBM dashdb With IBM Embeddable Reporting Service

Chapter 15: Forms. User Guide. 1 P a g e

Avaya Network Configuration Manager User Guide

Excel Integrated Reporting

How To Change Your Site On Drupal Cloud On A Pcode On A Microsoft Powerstone On A Macbook Or Ipad (For Free) On A Freebie (For A Free Download) On An Ipad Or Ipa (For

Business Objects Version 5 : Introduction

Cloudera Manager Training: Hands-On Exercises

MyOra 3.0. User Guide. SQL Tool for Oracle. Jayam Systems, LLC

Hamline University Administrative Computing Page 1

Veritas Cluster Server Database Agent for Microsoft SQL Configuration Guide

Quick start. A project with SpagoBI 3.x

Sage Intelligence Financial Reporting for Sage ERP X3 Version 6.5 Installation Guide

Data processing goes big

Excel Companion. (Profit Embedded PHD) User's Guide

DocAve 6 Service Pack 1 Job Monitor

WebSpy Vantage Ultimate 2.2 Web Module Administrators Guide

Utilities ComCash

Developing Rich Web Applications with Oracle ADF and Oracle WebCenter Portal

REUTERS/TIM WIMBORNE SCHOLARONE MANUSCRIPTS COGNOS REPORTS

Team Foundation Server 2012 Installation Guide

Finance Reporting. Millennium FAST. User Guide Version 4.0. Memorial University of Newfoundland. September 2013

Quick Start Guide to. ArcGISSM. Online

InfoView User s Guide. BusinessObjects Enterprise XI Release 2

Transaction Monitoring Version for AIX, Linux, and Windows. Reference IBM

ISVforce Guide. Version 35.0, Winter

SAS. Cloud. Account Administrator s Guide. SAS Documentation

SPHOL207: Database Snapshots with SharePoint 2013

FileMaker 11. ODBC and JDBC Guide

FileMaker Server 14. FileMaker Server Help

FileMaker 12. ODBC and JDBC Guide

SQL Server 2005: Report Builder

WebSphere Business Monitor V6.2 Business space dashboards

PORTAL ADMINISTRATION

SAS BI Dashboard 4.3. User's Guide. SAS Documentation

Search help. More on Office.com: images templates

Crystal Reports Installation Guide

SAS Task Manager 2.2. User s Guide. SAS Documentation

Installing and Configuring DB2 10, WebSphere Application Server v8 & Maximo Asset Management

Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager

CRM Migration Manager for Microsoft Dynamics CRM. User Guide

Dashboard Admin Guide

Important Notice. (c) Cloudera, Inc. All rights reserved.

Database Studio is the new tool to administrate SAP MaxDB database instances as of version 7.5.

Hypercosm. Studio.

FileMaker Pro and Microsoft Office Integration

ARIBA Contract Management System. User Guide to Accompany Training

TIBCO Spotfire Metrics Modeler User s Guide. Software Release 6.0 November 2013

SAP BusinessObjects Financial Consolidation Web User Guide

Oracle Business Intelligence 11g OPN Advanced Workshop

Quest ChangeAuditor 5.1 FOR ACTIVE DIRECTORY. User Guide

Managing Software Updates with System Center 2012 R2 Configuration Manager

Application. 1.1 About This Tutorial Tutorial Requirements Provided Files

Basic Web Fullerton College

Create a New Database in Access 2010

Table of Contents. 1. Content Approval...1 EVALUATION COPY

AvePoint Tags 1.1 for Microsoft Dynamics CRM. Installation and Configuration Guide

Version End User Help Files. GroupLink Corporation 2014 GroupLink Corporation. All rights reserved

Version End User Help Files. GroupLink Corporation 2015 GroupLink Corporation. All rights reserved

Transcription:

Hands on Lab Introduction to Hadoop on the cloud using BigInsights on BlueMix dev@pulse, Feb. 24-25, 2014 Cindy Saracco, Senior Solutions Architect, saracco@us.ibm.com, @IBMbigdata Nicolas Morales, Solutions Engineer, nicolasm@us.ibm.com, @NicolasJMorales 1

Table of Contents Getting started... 3 Pre-requisites... 3 What you'll learn... 4 Exercise 1: Exploring the Web Console... 5 Launching the Web Console... 6 Working with the Welcome page... 9 Inspecting the status of your cluster...10 Working with Files...12 Exercise 2: Analyzing data with BigSheets... 18 Collecting social media data...19 Creating a BigSheets workbook...20 Tailoring workbooks and generating charts...23 Exercise 3: Querying data with IBM Big SQL... 35 Obtaining sample data...36 Creating, loading, and querying a Big SQL table...38 Optional Exercise: Setting up an Eclipse environment... 41 Getting started...41 Creating a BigInsights Server Connection...43 Creating a Big SQL Connection in Eclipse...44 Creating and testing a Big SQL JDBC client application...46 Summary... 52 Important Information:... 53 2

Getting started In this hands-on lab, you'll learn how to work with IBM s Platform-as-a-Service (PaaS) for MapReduce, a cloud offering now in beta. This offering enables Hadoop developers to quickly get started using critical services from InfoSphere BigInsights (IBM s Hadoop-based platform) to create big data applications. By using this beta service, developers can avoid the overhead of acquiring and provisioning their own hardware cluster. Moreover, a cloud-based infrastructure enables them to rapidly scale their hardware environment as their application needs increase. In this lab, you'll learn how to use IBM s Hadoop-based cloud services to explore social media data. In particular, you ll investigate global coverage of a popular brand ( IBM Watson ) through the use of a spreadsheet-style interface. Later, you ll learn how you can query this social media data using Big SQL, IBM s SQL interface to data managed by BigInsights. As an aside, if you prefer to work with BigInsights on your own hardware, you can download a free Quick Start Edition installation image or VMWare image. Just visit this web site. Pre-requisites This lab uses beta software available on IBM s BlueMix cloud environment. Prior to starting this lab, you need to obtain a BlueMix account. Registration is free, although the number of available seats is limited. To apply for an account, visit http://ng.bluemix.net and click the Join us in beta button. Once you have an account, you should become familiar with the BlueMix environment before starting this lab. In particular, you should be able to log into your account, create an application, and bind that application to IBM's MapReduce service, which is the 3

subject of this lab. (The process of binding your application to this service is the same as the process of binding your application to other services available on BlueMix.) If necessary, consult the online BlueMix documentation or enroll in a separate lab to learn more about the BlueMix environment. Finally, you need to locate the settings for your MapReduce environment variables, as these include the appropriate URLs for accessing various BigInsights services as well as the required user ID and password. What you'll learn After completing this hands-on lab, you ll be able to: Launch the Web console and access several of its key services Explore big data using a spreadsheet-style tool Query big data using Big SQL Configure Eclipse to use the BigInsights plug-in, which includes Big SQL support Allow 1 to 2 hours to complete the core sections of this lab. NOTE: Images of screen captures contain sample data. Your output may vary, depending on your environment. The BlueMix environment is expected to evolve throughout its beta program and user interfaces are subject to change. In addition, some code examples in this lab include sample user IDs, passwords, and service URLs. You will need to modify the examples to include appropriate data for your environment. To learn more about IBM s Hadoop-based platform and its MapReduce service, visit the BigInsights technical wiki or participate in the forum. To learn more about BlueMix and participate in its community, visit the BlueMix Dev site. 4

Exercise 1: Exploring the Web Console IBM s Hadoop-based services enable firms to store, process, and analyze large volumes of various types of data. Included in these services is access to a Web console for inspecting the health of your cluster, monitoring the status of jobs (applications), downloading certain application development aids, and performing other functions. Before developing your application that uses IBM s MapReduce service, it will be helpful for you to become familiar with the Web console. For further details on the Web console or BigInsights, consult the product documentation. After completing this hands-on lab, you ll be able to: Launch the Web console. Work with popular resources accessible through the Welcome page. Inspect the status of your cluster. Work with the distributed file system. In particular, you'll explore the Hadoop Distributed File System (HDFS) directory structure, create subdirectories, and upload a file to HDFS. Allow 15-30 minutes to complete this section of lab. This lab is an introduction to a subset of console functions. Administrative capabilities available through the Web console in production environments won't be covered here, because your BlueMix beta account currently lacks administrative authority. In addition, real-time monitoring dashboards and application linking are among the more advanced console functions that are out of this lab's scope. 5

Launching the Web Console In this section, you'll learn how to launch the Web console for the IBM MapReduce (BigInsights) service. 1. If necessary, create a new application using an appropriate BlueMix boilerplate template available in the catalog and add the IBM MapReduce service to it. Subsequent sections of this exercise use the Java+DB Web Starter application boilerplate as an example. 6

2. Verify that your cloud application that uses IBM's MapReduce service is up and running. (The BlueMix dashboard displays the status of your applications.) For example, the image below depicts a JavaDBSample application that includes IBM's MapReduce service (shown circled). Note that the application's status shows that it is running. 3. Optionally, double click on your application's icon (not the displayed URL) to display further details about it. Using the example above as a guideline, click on the orange box at top to see additional information about the services available. 7

4. Locate the VCAP_Services environment variables associated with your application to determine the MapReduce service's Web console URL, user ID, and password. The way in which you access this information will depend on the application boilerplate you selected from the BlueMix dashboard. Often, the first category beneath your application's OVERVIEW button will contain the appropriate information. In this example, there is a RUNTIME button that displays this information if you scroll down to the bottom of the page. Environment variable settings are in JSON format. The items shown below in bold highlight sample information you need to collect for this exercise. { "name": "MapReduce-ei562", "label": "MapReduce-2.1.0", "plan": "Community", "credentials": { "username": "u123456", 8

} 5. 6. 7. "password": "pw123456.1234567890", "BigSqlUrl": "jdbc:bigsql://11.22.33.44:7052/dbu123456", "ConsoleUrl":"https://11.22.33.44:8080/data/html/index.html", "HiveUrl":"jdbc:hive://11.22.33.44:10000/dbu123456", "HttpfsUrl":"http://11.22.33.44:14000/webhdfs/v1/" } Copy and paste the ConsoleUrl value into your Web browser. When prompted, enter the username and password values to log into the console. Verify that your Web console appears similar to this: Working with the Welcome page This section introduces you to the Web console's main page displayed through the Welcome tab. The Welcome page features links to common tasks, many of which can also be launched from other areas of the console. In addition, the Welcome page includes links to popular external resources, such as the BigInsights Information Center (product documentation) and community forum. 1. Inspect the Quick Links pane at top right and use its vertical scroll bar (if necessary) to become familiar with the various resources accessible through this pane. Note that this section contains links for downloading software drivers and an Eclipse plug-in. 9

2. Inspect the Learn More pane at lower right. Links in this area access external Web resources that you may find useful, such as the BigInsights Information Center, a public discussion forum, IBM support, and IBM's BigInsights product site. If desired, click on one or more of these links to see what's available. Inspecting the status of your cluster The Web console allows administrators to inspect the overall health of their cluster as well as perform basic functions, such as starting and stopping specific servers or components, adding nodes to the cluster, and so on. The free BlueMix beta offering precludes you from obtaining administrative status at this time, but you can still explore some basic capabilities that don't require administrative authority. 1. Click on the Cluster Status tab at the top of the page. 10

2. Inspect the overall status of your cluster. The figure below was taken on a cluster of 7 nodes that had several services running. (Host node information about each node was masked in this graphic; your display will show IP addresses of each node in your cluster.) Note that on this cluster, HBase, Monitoring, and Oozie services were unavailable. 3. Click on the Hive service and note the detailed information provided for this service in the pane at right. (Host node information was masked in this graphic; your display will show node IP addresses for the Hive Node, Hive Web Interface, and JDBC URL.) By clicking on any installed service, administrators can start or stop the service. 11

Working with Files The Files tab of the console enables you to explore the contents of your file system, create new subdirectories, upload small files for test purposes, and perform other filerelated functions. In this module, you ll learn how to perform such tasks against the Hadoop Distributed File System (HDFS) of BigInsights. 1. Click on the Files tab of the Web console to begin exploring your distributed file system. 2. Expand the directory tree shown in the pane at left to locate the /user subdirectory for your user ID (/user/biadmin is shown below). 3. 12

4. Become familiar with the functions provided through the icons at the top of this pane, as we'll refer to some of these in subsequent sections of this module. Simply point your cursor at the icon to learn its function. From left to right, the icons enable you to Copy a file or directory, move a file, create a directory, rename a file or directory, upload a file to HDFS, download a file from HDFS to your local file system, remove a file from HDFS, set permissions, open a command window to launch HDFS shell commands, and refresh the Web console page 5. Position your cursor on your user subdirectory (e.g., /user/biadmin in this example) directory and click the Create Directory icon to create a subdirectory for test purposes. 6. When a pop-up window appears prompting you for a directory name, enter ConsoleLab and click OK. 7. Expand the directory hierarchy to verify that your new subdirectory was created. 13

8. 9. Create another directory named ConsoleLabTest. Use the Rename icon to rename this directory to ConsoleLabTest2. 10. Click the Move icon, when the pop up Move screen appears select the ConsoleLab directory and click OK. 11. Using the set permission icon, you can change the permission settings for your directory. When finished click OK. 14

12. While highlighting the ConsoleLabTest2 folder, select the Remove icon and remove the directory. 13. Obtain the sample blogs-data.txt file from your instructor, or download the sampledata.zip file from this article and extract the.zip file to a directory on your local file system. In a moment, you will upload the blogs-data.txt file to the cloud DFS. 14. In the ConsoleLab directory of your cloud DFS, and click the Upload icon to upload a small sample file for test purposes. 15

15. When the pop-up window appears, click the Browse button to browse your local file system for the sample file you obtained earlier (blogs-data.txt). 16. Navigate through your local file system to the directory and locate the blogsdata.txt file. Click OK. 17. Verify that the window displays the name of this file. Note that you can continue to Browse for additional files to upload and that you can delete files as upload targets from the displayed list. However, for this exercise, simply click OK. 18. When the upload completes, verify that the file appears in the directory tree at left, if it is not immediately visible click the refresh button. On the right, you should see a subset of the file s contents displayed in text format 19. Highlight the blogs-data.txt file in your ConsoleLab directory and click the Download button. 20. When prompted, click the Save File button. Then select OK. 16

17

Exercise 2: Analyzing data with BigSheets IBM s Hadoop-based offering enables firms to store, process, and analyze large volumes of various types of data. In this exercise, you ll see how you can explore social media data collected from a sample application provided with InfoSphere BigInsights using BigSheets, a spreadsheet-style tool accessible from the Web console. This lab exercise based on an article that can be found here: http://www.ibm.com/developerworks/data/library/techarticle/dm- 1206socialmedia/index.html. It s a good idea to run this article before attempting the lab, as the article explains the business context and application scenario covered in this exercise. If you prefer, you can watch a 14-minute video from the author of the article here: http://www.youtube.com/watch?v=kny3npwsz_w Before completing this exercise, you should be familiar with the Web console and be able to perform the basic operations covered in the previous lab module. After completing this hands-on lab, you ll be able to: Create BigSheets workbooks based on social media data collected about a popular brand ( IBM Watson in this scenario). Perform simple data cleansing and analytical operations to discover insights about the social media data. Tag your workbooks so you can easily locate those of interest later. Create charts based on your analysis. Allow 45-60 minutes to complete this lab. 18

Collecting social media data Sample social media data about "IBM Watson" is available for public download from the BigSheets developerworks article referenced above. (In production environments, analysts can run an IBM-provided application to collect social media data about search items of their choice.) Where did this data come from? Data for this lab was collected using IBM s sample BoardReader application provided with InfoSphere BigInsights. This application collected data from thousands of news and blog web sites, and a subset of this information was provided as an attachment to a developerworks article. For purposes of this lab, examples cite a user ID of "biadmin" for your BigInsights / MapReduce service. Substitute your user ID for biadmin as you work through the exercise. 1. 2. 3. 4. 5. If necessary, launch the Web console. Download and unzip the sampledata.zip file provided with this article into a local directory on your computer. Using the DFS navigator available from the Files tab of the Web console, create subdirectories for /sampledata and sampledata/ibmwatson under your user ID's directory. For example, if your user ID is biadmin, create a /user/biadmin/sampledata/ibmwatson directory. Upload the news-data.txt and blogs-data.txt files to the../sampledata/ibmwatson directory. To review this data, use the Files tab to navigate to the following folder (/user/biadmin/sampledata/ibmwatson) and select the blogs-data.txt file as shown below. 19

In a future section, you will convert this file to a BigSheets workbook so you can explore, customize, and visualize the data. Creating a BigSheets workbook In this section, you will use a spread-sheet style interface (BigSheets) to explore the social media data you just uploaded. BigSheets provides access to data in structures known as workbooks. 1. 2. Return to the Files tab. Navigate to the /user/biadmin/sampledata/ibmwatson/blogs-data.txt file and click on the file. 3. Click the Sheet radio button to view this data within a BigSheets interface. 20

4. The data is formatted in a JSON Array structure. Click the pencil icon and select the JSON Array option for this file. Then click the green check mark. 5. Save this as a Master Workbook named Watson Blogs. Optionally, provide a description. Click the Save button. 6. Repeat this process for the news-data.txt file in the same folder. To do this, return to the Files tab, navigate to the file, and follow the 3 previous steps. This time, name the workbook Watson News. 7. Click on the Workbooks link in the upper left-hand corner of the page. 8. Verify that you see these two workbooks on your system. 21

9. Add tags to your workbook so users can easily search for and locate it among a long list of workbooks. To do so, first select the Watson Blogs workbook. 10. Scroll down to Workbook Details and add tags for Watson IBM Blogs by selecting the green + and adding each individually. If you don t see Workbook Details, you may need to toggle between Normal and Full Screen. 11. From the BigSheets tab, you can quickly filter workbooks and search for a specific tag. Enter the term tag: Blogs to see all workbooks that have the associated tag. 22

Tailoring workbooks and generating charts In this section, you'll learn how to customize your workbook in a few simple ways. For example, you'll learn how to remove unwanted columns for a given workbook, combine data from multiple workbooks together, and perform simple data cleansing operations. You'll even see how you can visualize your results in simple charts. 1. 2. From the list of workbook displayed in BigSheets (which you launched in previous steps), click on the link named Watson News to open this workbook. This Master Workbook is a base workbook and has a limited set of things you can edit. Therefore, in order to begin to manipulate the data contained within a workbook, we will want to create a dependent workbook. a. Click the Build new Workbook button b. When the new Workbook appears, you can change its default name (by clicking on the pencil icon next to the name) to the new name of Watson News Revised then click the green check mark. c. Click the Fit column(s) button to more easily see columns A through H on your screen. 23

3. Remove the column IsAdult from your workbook. This is currently column E. Click on the triangle next to the column name of IsAdult and select the Remove option to remove this from your new workbook. Did I lose data? Deleting a column does not remove data. Deleting a column in a workbook just removing the mapping to this column. 4. In this case, you want to keep only a few columns. In order, to more easily remove a larger number of columns (without having to do this same clickremove process), click the triangle again (from any column) and select the Organize Columns option. a. Click the red X button next to each column title you want to remove. In this case, KEEP the following columns i. Country ii. FeedInfo iii. Language iv. Published v. SubjectHtml vi. Tags vii. Type viii. Url 24

b. Click the green check mark button when you are ready to remove the columns you have selected to remove. 5. Click on the Fit column(s) button again to show columns A through H. You should see the following columns in your new workbook. 6. Select Save and Exit. You may input an optional description. Click Save to complete the save. 7. After clicking Save, you will be shown two buttons (run and close). Click the Run button to run the workbook. You can monitor the progress of your request by watching the status bar indicator in the upper right-hand side of the page. 8. To reduce the unwanted columns in the Watson Blogs workbook, you will want to perform the same steps above in order to wind up with a new workbook called Watson Blogs Revised 25

9. Now, since we have two workbooks with the exact same structure, we can perform a union of these two workbooks as the basis for exploring the coverage of IBM Watson across the sources that Boardreader provided. 10. To perform this action, make sure you are currently in the Watson News Revised workbook. Click the Build New Workbook button again. 11. In the top left-hand side or bottom left, you should see a link called Add sheets. This allows you to perform additional analysis on your data within the current workbook. Click the Add sheets link. 12. The Load option will allow you to load data into the current workbook from another workbook. Click the Load icon and select the Watson Blogs Revised workbook link. 13. The system will ask you for a *Sheet Name and you should change Sheet1 to Watson Blogs Revised as the name of the new tab that will be created in your current workbook. 26

14. Click the green check-mark button at this time to load the new workbook into your current workbook. 15. Verify that you see two tabs at the bottom on your current workbook. Move your mouse over the second one, and a tool tip will show the action and the name you provided for this sheet / tab within you current workbook. (Giving your tabs meaningful names will help you and other that use your sheets an easy way to understand your data processing flow(s).) 16. Next, add a new sheet to perform the Union function. Select Union. 17. The Union function asked for the other sheet you would like to use. Select the triangle to expose the pull-down menu. 18. Select Watson News Revised and then click the green plus-mark button. 19. Provide the sheet name News and Blogs. Before you click the green checkmark button to add this new tab/sheet to your workbook, make sure your options match the example below. 27

20. Click the green check-mark button to add this tab to your workbook. 21. Save and Exit and then run this new workbook. When prompted for a description, you can change the name of your new workbook from Watson News Revised(1) to Watson News and Blogs. Click the Save button. Then click the Run button to run the workbook. 22. Select the Workflow Diagram icon to see a mapping of the workbooks associated with the News and Blogs workbook. This can be done at any point to keep a clear picture of which workbooks you are extending to/from. 23. Close this frame. 24. If you are not already in the workbook, open the Watson News and Blogs workbook. 25. In this case, we want to keep our initial workbook as is and produce another workbook that contains the records in sorted order. So, click the Build New Workbook button to do this. 26. To more easily keep track of what you are doing, rename your new workbook immediately. (This is a good practice to follow in general.) Call this workbook Watson Sorted. 27. Explore the language and the types of posts contained in your workbook. To do so, click the triangle next to the column name of any column so that you can select the Sort -> Advanced option. 28

28. Click on the pull-down triangle to expose the list of columns under the Add Columns to Sort area. Click on the green + button to add the two columns you wish to sort on. Then, select the desired order for sorting each column. In this case, your Advance Sort should look like the following picture.. 29. Click on the green check-mark button to continue and create the new tab/sheet with your desired sorting applied to it. 30. As with all new tabs/sheets, the system shows you a simulated result based on the rows of data BigSheets keeps in memory. You should be able to click on Fit column(s) to review the contents of both the Language and Type columns to see that your advanced sort was applied to this simulated set of data. 31. Now, Save and Exit and then run your workbook. This will apply the sorting options to more than the first 2,000 rows the system operates on as a simulation. This will sort the entire, larger data in the workbook. So, you should see different results once your workbook has been run. For example, in the simulated data, only one Vietnamese row was showing. However, against the entire data set, you should see twenty (20) rows that are of the Vietnamese language. This is because more of the Vietnamese rows were in the data beyond the first 2,000 rows the system uses in memory for a simulated result before you click the run button. Review and confirm these results after the job reaches 100% and then you can move onto the next step. 29

32. To easily visualize the coverage of posts about IBM Watson by language, you can create a chart. While still in the Watson Sorted workbook, click the Add chart link in the lower left. When the list of available chart types is displayed, click Chart > Pie. Then complete the following information to produce a pie chart of the languages used. 33. Click on the green check-mark button to create the chart tab. 34. Just like working with tabular data, you will see a simulated visualization. Again, this is based on the rows in cache. (If you click on the Close button here, you can interact with the chart which is based on simulated data. You would then click the Run button in the upper right.) 35. Click the Run button to run the visualization against the entire data set. 36. Once the chart has been run, you can interact with it to find out the second, most-popular language for posts regarding IBM Watson is Russian. Move your mouse over this item within your pie chart to see these results. 30

37. Mouse over the fifth and sixth largest languages in the pie chart you just generated, and note that they are both variations on the Chinese language. In the steps that follow, you'll clean up this data so that all forms of "Chinese" will appear as a single category. 38. From the Watson Sorted workbook, click on the Edit button. 39. Optionally, click on the Fit column(s) button to make your columns thinner and to see more data on the screen. 40. Add another column to your workbook to capture the new values for various forms of "Chinese". To do this, click on the triangle next to the Language column name. Select the Insert Right -> New Column option. 31

41. Then, you will provide a name for your new column, like Language_Revised and then click the green check-mark button (or hit enter) to apply your new column name. 42. Your cursor is then moved to the fx (or function) area where you can provide the function to be used to generate the contents of your new column. 43. Enter the following formula as your function IF(SEARCH('Chin*', #Language) > 0, 'Chinese', #Language) This formula looks at the Language column indicated by #Language. If the #Language column starts with Chin*, then the new #Language_Revised column with contain Chinese. If it does not, the value of #Language is copied over to #Language_Revised. (See the original article, URL at the top of this document, for additional explanation of this formula.) 44. Click the green check-mark button (or hit Enter). The output of this formula will appear in your new column. 32

45. Click Save and Exit. You will be prompted to Click run to update the data. 46. Click the Run button in the upper right to run the workbook. 47. Now, click on the Language Coverage tab that contains your previously generated pie chart. This now has the status of needs to be run. Before we run it, we need to change one of the settings on the pie chart to use our newly generated column named Language_Revised. 48. To change the settings, click on the triangle next to the Language Coverage tab. 49. Click to select the Chart Settings option. 50. Change the Value: item to be based on the new, Language_Revised column. 33

51. Click on the green check-mark button to apply your new settings. 52. Click on the Run button to regenerate your pie chart. 53. Once your new pie chart has been generated, you should be able to see Chinese as a cleaned up, single item in your pie chart (compared to the two items you saw previously). With this cleansed data, Chinese is now the second largest and Russian is third. 34

Exercise 3: Querying data with IBM Big SQL In this exercise, you ll learn how to use IBM Big SQL, an SQL language processor, to summarize, query, and analyze data in a data warehouse system for Hadoop. Big SQL provides broad SQL support that is typical of commercial databases. You can issue queries using JDBC or ODBC drivers to access data accessible through IBM s MapReduce service on the cloud in the same way that you access databases from your enterprise applications. To keep things simple and enable you to concentrate on learning Big SQL, you ll use the interactive Big SQL application available from the Web console to issue your SQL statements. Alternatively, you can use the InfoSphere BigInsights Tools for Eclipse to create and run Big SQL queries interactively from Eclipse. Before completing this lab, you should be familiar with the Web console and BigSheets, as certain exercises in this lab use the output of work covered in previous labs. After you complete this module, you will understand how to: Create a Big SQL table that uses Hive as its storage manager. Load data exported from BigSheets into your Big SQL table. Query your Big SQL table from the Web console. Allow 30 minutes to complete this lab. 35

Obtaining sample data In this section, you will use data contained in one of your BigSheets workbooks as sample data to load into a Big SQL table. If you haven't already done so, complete at least the first two sections of the previous lab on BigSheets. You will need access to the Watson Blogs Revised workbook created from that lab. 1. 2. 3. If necessary, launch the Web console and click on the BigSheets tab. Open the Watson Blogs Revised workbook you created previously. In the upper right corner, click the Export As button. When prompted, select TSV (tab-separated values) from the drop-down list of data format types. 4. 5. Select File as the Export to: destination source. Click the Browse button. In the DFS file navigator window that appears, navigate to the../sampledata subdirectory of your user ID and enter sampleblogs as the file's name. (Do not add a.tsv suffix -- this will be done automatically.) 36

6. Click OK. Inspect the data shown and verify that the Include Headers button is unchecked. Click OK again. 7. Click on the Files tab of the Web console, and navigate to the /sampledata directory where you exported the file. Verify that sampleblogs.tsv is present. 37

Creating, loading, and querying a Big SQL table Now that you have the results of your BigSheets analysis ready, you can create a Big SQL table for it, load that table with your data, and query the table's contents. For simplicity, this lab explains how to do that from the Web console. If you already have your Eclipse environment set up for working with BigInsights on BlueMix, you can issue these statements from a Big SQL file in one of your projects as well. 1. From the Welcome tab of the Web console, click on the Run Big SQL Queries link in the Quick Links section. A new tab will appear in your Web browser. 38

2. { 3. Determine the name of the Big SQL schema that your user ID is authorized to access. This information is part of the BigSqlUrl environment variable included in the VCAP_SERVICES list, which was discussed in the first lab module (on the Web console). In the example shown below, the JDBC URL for Big SQL is jdbc:bigsql://11.22.33.44:7052/db0jyxfcby so the database name for this user is db0jyxfcby. "name": "MapReduce-ei562", "label": "MapReduce-2.1.0", "plan": "Community", "credentials": { "username": "u123456", "password": "pw123456.1234567890", "BigSqlUrl": "jdbc:bigsql://11.22.33.44:7052/ db0jyxfcby", "ConsoleUrl":"https://11.22.33.44:8080/data/html/index.html", "HiveUrl":"jdbc:hive://11.22.33.44:10000/ db0jyxfcby", "HttpfsUrl":"http://11.22.33.44:14000/webhdfs/v1/" } You will need to refer to this database (schema) name in your queries. In the middle box of the Big SQL query application, type a CREATE TABLE statement similar to the example shown below, adjusting the schema name for the table to match your environment. create table if not exists schema-name.watsonblogs (country char(2),feedinfo varchar(300), countrylang char(25),published char(25), subject varchar(300), tags varchar(100), type char(20), url varchar(100)) row format delimited fields terminated by '\t'; The screen capture shown above depicts a version of this statement that creates a table in the db0jyxfcby schema (database) named watsonblogs if such a table doesn't already exists. The table consists of 8 columns, each corresponding to a field in the TSV file generated by the BigSheets export operation. 4. 5. Click Run and verify that the operation completes successfully. Next, enter a LOAD command similar to this example, adjusting the path specification and database schema name to match your environment: 39

load hive data inpath '/user/0jyxfcby/sampledata/sampleblogs.tsv' overwrite into table db0jyxfcby.watsonblogs; Note that this command loads data from the watsonblogs.tsv file in the /user/0jyxfcby/sampledata subdirectory of the distributed file system into the db0jyxfcby.watsonblogs table, overwriting any data that might be present in the table. In keeping with Hive's behavior, this command moves the file from its original DFS directory into the Hive database. 6. 7. Click Run and verify that the operation completes successfully. Finally, query the table with a SELECT statement similar to this: select * from db0jyxfcby.watsonblogs limit 10; Remember to adjust the SELECT statement to reference the appropriate schema for your environment. 8. Run the command and inspect your output. 9. If desired, create additional Big SQL tables based on other BigSheets workbooks that you export and experiment with querying data in these tables. 40

Optional Exercise: Setting up an Eclipse environment IBM provides Eclipse tooling to simplify development of applications that use BigInsights services. This optional exercise takes you through the basics of configuring an appropriate Eclipse environment to work with some of the BlueMix MapReduce services available to you. In this exercise, you will learn how to: Download IBM Eclipse tooling for BigInsights Configure a BigInsights server connection Configure a Big SQL connection Create and test a Big SQL JDBC client application Allow 15 30 minutes to complete this exercise (not including software download time). Getting started 10. From the Welcome page of the Web console, click on the Quick Link for information about enabling your Eclipse development environment. 11. Review the information displayed. 41

12. If necessary, download the appropriate Eclipse shell from www.eclipse.org. 13. Click on the link provided to review the detailed information in the BigInsights Information Center on this topic. 14. Launch Eclipse, and follow the standard process for installing new software. (For example, click Help > Install new software.) 15. After you've installed the BigInsights plug-in, verify that your installation was successful. Open Eclipse. The Task Launcher for Big Data should appear. 42

16. If the Task Launcher does not appear, you may need to open the BigInsights perspective manually. From the Eclipse menu items at top, select Window > Perspective > BigInsights. (If necessary, click Window > Perspective > Other > BigInsights.) Creating a BigInsights Server Connection Issuing interactive Big SQL statements requires a live connection to the IBM MapReduce service on BlueMix (i.e., a BigInsights server connection). This section describes how you can define a BigInsights server connection in Eclipse. 1. From the Overview tab of the Task Launcher for Big Data, click Create a BigInsights server connection. 2. Enter the appropriate information in the pop-up window, including the URL to access your BigInsights Web console, a server name of your choice, a valid BigInsights user ID, and a password. (The information shown below contains sample information -- the data you enter must match the VCAP_Services environment variable values for your BlueMix environment.) 43

. 3. 4. 5. Click the Test connection button and verify that you can successfully connect to your target cluster. Click the Save password box and Finish. In the BigInsights Server pane, expand the list of servers and verify that the server connection you created appears. Creating a Big SQL Connection in Eclipse Certain tasks require a live connection to a Big SQL server within the BigInsights cluster. This section explains how you can define a JDBC connection to your Big SQL server. 6. Open the Database Development perspective. Window > Open Perspective > Other > Database Development. 7. In the Data Source Explorer pane, right click on Database Connections > Add Repository. 44

8. In the New Connection Profile menu, select Big SQL JDBC Driver and enter a name for your new driver (e.g., My Big SQL Connection). Click Next. 9. Enter the appropriate connection information for your environment, including the host name, port number user ID, and password. Verify that you have selected the correct JDBC driver at top. (The information shown below contains sample information -- the data you enter must match the VCAP_Services environment variable values for your BlueMix environment.) 45

10. Click the Test connection button and verify that you can successfully connect to your target Big SQL server. 11. Click the Save password box and Finish. 12. In the Data Source Explorer, expand the list of data sources and verify that your Big SQL connection appears. Creating and testing a Big SQL JDBC client application IBM s MapReduce service enables you to write a JDBC client application to access Big SQL data much as you would access data in any relational database. You don t even need to download and install the BigInsights Eclipse plug-in to do so you just need the Big SQL JDBC client package, accessible from the Web console s Welcome page. 46

In this exercise, you ll create a simple Java project with a client application that opens a Big SQL database connection, executes a simple query, and displays the results. 1. From the Welcome page of the Web console, click on the Quick Link for downloading the Big SQL Client drivers. If you download this.zip file, you ll note that it contains a JDBC.jar file. (The JDBC client driver is also included in the BigInsights plug-in you downloaded in the previous sections.) 2. In Eclipse, create a Java project by clicking File > New >Project. From the New Project window, select Java Project. Click Next. 3. 4. Type a name of your choice for the project in the Project Name field. Click Next. Open the Libraries tab and click Add External Jars. Provide a path to the Big SQL JDBC driver (bigsql-jdbc-driver.jar). 47

5. 6. Click Finish. (If you re prompted to open a different perspective, click No.) Right-click on your Java project, and click New > Package. Enter a name for your package when prompted, and click Finish. 7. 8. Right-click your package, and click New > Class. In the New Java Class window, enter a name for your class. Select the public static void main(string[] args) check box. Click Finish. 48

9. Copy or type the following code into your.java file. Note that you will need to adjust some variable settings shown here to match your environment. // a. Declare package & class names; import required package(s) package test; import java.sql.*; public class Sample { //b. set JDBC & database info customize these for your env. static final String db = "jdbc:bigsql://74.111.222.33:7052/yourschema"; static final String user = "yourid"; static final String pwd = "yourpassword"; /** * @param args */ public static void main(string[] args) { // TODO Auto-generated method stub Connection conn = null; Statement stmt = null; System.out.println("Started sample JDBC application."); try{ //c. Register JDBC driver Class.forName("com.ibm.biginsights.bigsql.jdbc.BigSQLDriver"); //d. Get a connection conn = DriverManager.getConnection(db, user, pwd); 49

System.out.println("Connected to the database."); //e. Execute a query // Change the schema name in the SELECT statement stmt = conn.createstatement(); System.out.println("Created a statement."); String sql; sql = "select countrylang, subject, url from yourschema.watsonblogs limit 10"; ResultSet rs = stmt.executequery(sql); System.out.println("Executed a query."); //f. Obtain results System.out.println("Result set: "); while(rs.next()){ } //Retrieve by column name String lang = rs.getstring("countrylang"); String subject = rs.getstring("subject"); String url = rs.getstring("url"); //Display values System.out.print("* Language: " + lang + "\n"); System.out.print("* Subject: " + subject + "\n"); System.out.print("* Url: " + url + "\n\n"); } } a. //g. Close open resources rs.close(); stmt.close(); conn.close(); }catch(sqlexception sqle){ // Process SQL errors sqle.printstacktrace(); }catch(exception e){ // Process other errors e.printstacktrace(); }finally{ // Ensure resources are closed before exiting try{ if(stmt!=null) stmt.close(); }catch(sqlexception sqle2){ } // nothing we can do try{ if(conn!=null) conn.close(); }catch(sqlexception sqle){ sqle.printstacktrace(); }// end finally block }// end try block Adjust the package and class names shown, if needed, to match the names you selected earlier. 50

b. c. d. e. f. g. Modify the database connectivity variables as needed to match your environment. Register the JDBC driver so that you can open a communications channel with the database. The correct JDBC driver is shown. Open the connection. Run a query. Note that you will need to modify the SQL statement shown to reference the correct database schema name for your table. Extract data from result set. Clean up the environment by closing all of the database resources. 51

Summary Congratulations! You ve just learned how to get started using the beta version of IBM MapReduce services on the BlueMix cloud. Behind the scenes, this service uses InfoSphere BigInsights, IBM s Hadoop-based platform, to execute jobs on your behalf. Feel free to visit the public wiki to learn more about IBM s Hadoop-based platform through articles, videos, online course, etc. And be sure to post any questions you may have about IBM MapReduce services or BigInsights to the public forum. The authors would like to thank Louis Mau, Jayatheerthan Krishnamurthy, and Ellen Patterson for their assistance. 52

Important Information: References in this lab to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. These materials are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this document is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. Copyright IBM Corporation 2014. All rights reserved. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM, the IBM logo, ibm.com, and BigInsights are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at Copyright and trademark information at www.ibm.com/legal/copytrade.shtml Other company, product, or service names may be trademarks or service marks of others. 53