Actian Vortex Express 3.0 Quick Start Guide AH-3-QS-09
This Documentation is for the end user's informational purposes only and may be subject to change or withdrawal by Actian Corporation ("Actian") at any time. This Documentation is the proprietary information of Actian and is protected by the copyright laws of the United States and international treaties. It is not distributed under a GPL license. You may make printed or electronic copies of this Documentation provided that such copies are for your own internal use and all Actian copyright notices and legends are affixed to each reproduced copy. You may publish or distribute this document, in whole or in part, so long as the document remains unchanged and is disseminated with the applicable Actian software. Any such publication or distribution must be in the same manner and medium as that used by Actian, e.g., electronic download via website with the software or on a CD- ROM. Any other use, such as any dissemination of printed copies or use of this documentation, in whole or in part, in another publication, requires the prior written consent from an authorized representative of Actian. To the extent permitted by applicable law, ACTIAN PROVIDES THIS DOCUMENTATION "AS IS" WITHOUT WARRANTY OF ANY KIND, INCLUDING WITHOUT LIMITATION, ANY IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NONINFRINGEMENT. IN NO EVENT WILL ACTIAN BE LIABLE TO THE END USER OR ANY THIRD PARTY FOR ANY LOSS OR DAMAGE, DIRECT OR INDIRECT, FROM THE USER OF THIS DOCUMENTATION, INCLUDING WITHOUT LIMITATION, LOST PROFITS, BUSINESS INTERRUPTION, GOODWILL, OR LOST DATA, EVEN IF ACTIAN IS EXPRESSLY ADVISED OF SUCH LOSS OR DAMAGE. The manufacturer of this Documentation is Actian Corporation. For government users, the Documentation is delivered with "Restricted Rights" as set forth in 48 C.F.R. Section 12.212, 48 C.F.R. Sections 52.227-19(c)(1) and (2) or DFARS Section 252.227-7013 or applicable successor provisions. Copyright 2015 Actian Corporation. All Rights Reserved. Actian, Actian DataFlow, Actian Director, Actian Vector, Actian Vector Express, Actian Vector ExpressPlus, Actian Vector in Hadoop, Actian Vortex Express, Actian Vortex ExpressPlus, Action Server, Cloud Action Platform, Cloud Action Server, EDBC, Enterprise Access, Ingres, OpenROAD, and Vectorwise are trademarks or registered trademarks of Actian Corporation. All other trademarks, trade names, service marks, and logos referenced herein belong to their respective companies.
Contents Quick Start Demo 5 Introduction... 5 Use Case... 5 Prerequisites... 5 Remove KNIME Public Server (MapR Installations Only)... 5 Run the Demo... 5 Run the Workflow on a Cluster as a YARN Job... 8 Summary... 10 Next Steps... 10 Contents iii
Quick Start Demo Introduction This Quick Start Guide shows you how to run a simple, pre-built, end-to-end workflow that demonstrates the basic principles, components, and power of the Actian Analytics Platform. The demonstration workflow has been pre-configured to run immediately after installation with minimal additional configuration. Use Case This simple demonstration features the ability of the Actian Analytics Platform to connect to more than one data source, join the sources together, create a new field using the expression builder, persist the results into Actian Vector, and interact with those results through Actian Director. In this case we are exploring customer churn across two telecommunications data sets: one shows customer demographics and call log history, and the other contains geospatial (area code) information. The data allows the telecommunications provider to identify which customers have churned (changed carriers) and explore the characteristics and geographic regions where churn was the highest. Prerequisites Actian Vortex Express 3.0 must be installed to run the demo. Remove KNIME Public Server (MapR Installations Only) To use DataFlow extensions to KNIME with MapR, you must remove the KNIME Public Server Access software. For instructions, see Integrating DataFlow with MapR (http://help.pervasive.com/display/df651/integrating+dataflow+with+mapr). Run the Demo To run the demo, you must connect as user actian to the Linux master node on which you installed Vortex Express. Quick Start Demo 5
Run the Demo Start DataFlow (KNIME) 1. As user "actian", start Actian DataFlow (KNIME) on the Linux master node on which you installed Vortex Express. The password for user actian is the one you chose during installation. Start DataFlow by running the following command: knime 2. If prompted for a workspace location, accept the default option. The Welcome to KNIME dialog is displayed. 3. Select Open KNIME workbench (if this is the first time you have started KNIME). The workbench is loaded. Run the workflow 1. Open the pre-built Churn_Quick_Start workflow: Expand the LOCAL workspace located in the KNIME Explorer and double-click the Churn_Quick_Start workflow. KNIME loads the workflow. 6 Quick Start Guide
Run the Demo 2. Click the Execute All button on the toolbar. The Actian DataFlow workflow executes the following steps: a. In parallel, reads 10,000 rows of customer relationship management (CRM) data from a CSV file and combines them with 1,000,000 rows of Customer Geospatial data contained in a separate CSV file. b. Joins the two sources of information based on the common customer ID field. c. Derives a new field called Total Call Minutes based on the existing source fields, "calls" and "mins". d. Stores the resulting information in an Actian Vector in Hadoop table called tbl_churn_quick_start. Note: To rerun the workflow, you need to reset the nodes in the workflow. To reset a node: Right-click the node and select Reset. To reset all the nodes: Right-click the last node in the workflow (Load Actian Vector On Hadoop) and select Reset. The node status indicator turns from green to amber to indicate that it has been reset. To run the workflow again: Click Execute All. 3. Use Actian Director to view the resulting data stored in the Actian Vector table tbl_churn_quick_start. Actian Director lets you do a variety of tasks, such as creating and administering databases, running queries, and configuring security. a. Start Actian Director. You should be connected as user actian. On the command line, enter director. Director is started. The Actian Vector in Hadoop AH instance is displayed in the Instance Explorer. b. Connect to the Actian Vector in Hadoop AH instance by right-clicking the instance and selecting Connect. The Connect to Instance dialog is displayed. c. Enter the following credentials, and then click Connect: Authentication: Authenticated User Login: demo Password: hsedemo d. In the Instance Explorer, expand the Actian Vector in Hadoop AH instance and select Databases, sample, Tables, demo.tbl_churn_quick_start. Quick Start Demo 7
Run the Workflow on a Cluster as a YARN Job e. Right-click demo.tbl_churn_quick_start table and select Select First 1000 Rows. The query and its results are displayed. Note the last three columns: churn Identifies whether this customer has churned (changed carriers) areacode This field originated from the geospatial data we connected to at the beginning of the workflow. total_mins This is the new field derived automatically in the workflow. Run the Workflow on a Cluster as a YARN Job Adapting your workflow to your business needs can make the workflow more complicated and use larger amounts of data. DataFlow lets you execute your workflow as a distributed YARN job over your HDFS cluster so that it can scale out to use resources across the cluster. This is achieved through the DataFlow Cluster Manager, which is installed and set up as part of the Vortex Express installation. The process for enabling the workflow to execute over a cluster is as follows: 1. Create an execution profile. 2. Configure the workflow to use the profile. 3. Run the workflow. To create an execution profile 1. In KNIME, click File, Preferences, Actian. Existing profiles are shown in the Profiles section. You will create a cluster profile. 2. Click Add and enter "cluster" as the name of the new profile. A new profile called "cluster" is created and is automatically selected. 8 Quick Start Guide
Run the Workflow on a Cluster as a YARN Job 3. Click on the value for Execute in cluster and change it from false to true. 4. Click on the Cluster URL value and change it to yarn://<host-name-of-the-clustermanager>:47000. The Cluster Manager runs on the same node as the HDFS NameNode. For MapR, the Cluster Manager runs on the Vector in Hadoop master node. To get the name of the NameNode for non-mapr distributions, run the command hdfs getconf -confkey fs.defaultfs, which returns a string in the format: hdfs://<hdfs-namenode>:8020. 5. Click OK to accept the changes. The profile is configured to point to the DataFlow Cluster Manager. To configure the workflow to use the cluster execution profile 1. Right-click the Churn_Quick_Start_Vortex workflow in KNIME Explorer and select Configure. The Configure dialog is displayed. 2. Select the Job Manager Selection tab. In the Profile drop-down, choose the cluster profile that you have just created. The workflow is now configured to execute in a cluster. To run the workflow 1. Click Execute All on the toolbar (or select Node, Execute All). The workflow is executed across the cluster. Note: Executing the workflow over YARN requires YARN containers to be created. The creation of these containers can take time and you may find that your workflow seems to execute slower now than it did earlier. This startup time is usually a fixed duration and not tied to the size of the data that will be processed in the workflow. The DataFlow Cluster Manager also offers a Web UI console that you can log into and look at the YARN job execution details. To look at YARN job execution details 1. Point your browser to the URL http://<hdfs-namenode>:47100 and log in as user "root" and password "changeit". Note: On MapR, the URL is http://<vector-in-hadoop-master-node>:47100. The Vector in Hadoop master node is the node where you ran the Vortex Express installer. 2. Click Recent Jobs under the cluster monitoring section. The Churn_Quick_Start_Vortex application that you just executed is listed. 3. Click the Churn_Quick_Start_Vortex. Details are displayed such as how many YARN containers were used to run the job and which hosts the containers ran on. Quick Start Demo 9
Summary Summary In this Quick Start Guide, you were introduced to Actian s KNIME integration that featured a pre-built workflow that let you explore a simple telecommunications churn dataset. The executed workflow ran using Actian s embedded DataFlow engine that transparently distributed the data and processing across multiple cores on the nodes of your Hadoop cluster, and loaded the transformed data in parallel into the Vector in Hadoop database. Actian Vector in Hadoop is a distributed columnar database that leverages the Hadoop Distributed File System (HDFS) and MapR file system (MapR-FS) for storage. It uses a proven and an ANSI standard compliant SQL engine that performs native SQL processing of data in the distributed file system and can be used for efficient large-scale data warehousing, data mining, and reporting. It has rich SQL language support, an advanced query optimizer, support for trickle updates, and has been certified for use with the most popular BI tools. Vector guides can be accessed at esd.actian.com (http://esd.actian.com) or docs.actian.com (http://docs.actian.com/). Using KNIME is just one of the ways in which you can leverage DataFlow technology. DataFlow comes with rich Java and RushScript (based on JavaScript) APIs that offer you more programmatic control over your workflows. If you are interested in looking at the DataFlow API, look at DataFlow API Usage (http://help.pervasive.com/display/df651/dataflow+api+usage) for Java and at Using RushScript (http://help.pervasive.com/display/df651/using+rushscript) for RushScript. General help and overview on DataFlow can be found at Actian DataFlow 6.5.1 Help (http://help.pervasive.com/display/df651/actian+dataflow+6.5.1+help). Using DataFlow (either through KNIME, Java, or RushScript), you are able to pull in data from a variety of sources flat files, various database systems, S3, HDFS, and more process it according to your requirements, and optionally store the results into Vector in Hadoop for fast analytical reporting. DataFlow will transparently scale up and scale out (when using a cluster), which lets you stay focused on tweaking your analytical workflows and algorithms. Although this is a simple example with limited data and a simplified workflow for demonstration purposes, it represents a model for how customers are using the Actian Analytics Platform to solve real-world big data business problems. For more comprehensive workflow examples and real-world solution blueprints, visit the Actian Clear Path Program (http://www.actian.com/solutions/customer-analytics/). Next Steps As a next step, we recommend reading the Tutorial guide, which provides lessons on how to create a workflow from scratch using the drag and drop interface, and deploy it. It also includes lessons on how to connect to Actian Vector using JDBC. To proceed, access the Tutorial at http://esd.actian.com/express/vortex/3.0/tutorial_vortex.pdf (http://esd.actian.com/express/vortex/3.0/tutorial_vortex.pdf). 10 Quick Start Guide