HPE Vertica Analytic Database
Legal Notices Warranty The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. Hewlett Packard Enterprise shall not be liable for technical or editorial errors or omissions contained herein. The information contained herein is subject to change without notice. Restricted Rights Legend Confidential computer software. Valid license from Hewlett Packard Enterprise required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. Copyright Notice Copyright 2015 Hewlett Packard Enterprise Development L.P. Trademark Notices Adobe is a trademark of Adobe Systems Incorporated. Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation. UNIX is a registered trademark of The Open Group. This product includes an interface of the 'zlib' general purpose compression library, which is Copyright 1995-2002 Jean-loup Gailly and Mark Adler. Copyright 2015 Hewlett Packard Enterprise Development LP. Page 2
Table of Contents About Vertica Connection Guides... 4 Tungsten Replicator Overview... 4 Replication to Vertica... 4 Download and Install Tungsten... 5 Download and Install the Vertica Client Drivers... 5 Download Vertica Client Drivers... 6 Install Vertica Client Drivers... 6 Set up MySQL to Vertica Replication... 6 Configure Hosts... 6 Prepare Schemas in Vertica... 6 Install Vertica Replication... 7 Monitor Vertica Deployment... 9 Test MySQL with Vertica Replication... 11 Example: Use the SimpleBatchApplier... 12 Known Limitations... 14 Data Type Mapping: MySQL to Vertica... 14 For More Information...15 Copyright 2015 Hewlett Packard Enterprise Development LP. Page 3
About Vertica Connection Guides HPE Vertica connection guides provide basic information about setting up connections to Vertica from software that our technology partners create. These documents provide guidance using one specific version of Vertica and one specific version of the third party vendor s software. Other versions of the third-party product may work with Vertica. However, Hewlett Packard Enterprise may not have tested these other versions. This document provides guidance using the latest versions of Vertica and Tungsten as of. Tungsten Replicator Overview The Tungsten Replicator is an open source replication engine that supports many different extractor and applier modules. Data can be extracted from MySQL and Oracle, and applied to transactional stores. Data can also be applied to NoSQL stores such as MongoDB, and data warehouse stores such as Vertica. The core of the replication functionality lies in three major components: Transaction History Log (THL): The THL is the data that Tungsten takes from the master's binary logs and transports to its servers, with the addition of some metadata. Extractor: The extractor component reads data from the source data server and writes that information to the Transaction History Log (THL). The extractor is also responsible for writing the data into the THL in the native or derived format either as a SQL statement or as row-based information. Applier: Appliers within the Tungsten Replicator convert the THL information and write it to a destination data server. The applier has the capability to work with a number of different target databases, such as Vertica or MySQL. This document demonstrates how users can replicate data from MySQL to Vertica using Tungsten. Tungsten uses the JDBC driver to connect to Vertica. This document provides the replication mechanism from MySQL 5.6 to Vertica 7.2.x using Tungsten 4.0. Hewlett Packard Enterprise has tested Vertica 7.1.x and 7.2.x with Tungsten 4.0.0.18 on a Linux platform. Replication to Vertica In heterogeneous systems, it is not possible to execute the DDL and DML statements from source systems into target systems because the SQL dialects are different. Tungsten uses row-based replication when performing replication from MySQL to Vertica. Replication to Vertica follows this flow: Copyright 2015 Hewlett Packard Enterprise Development LP. Page 4
Data is extracted from the source database into the THL. While extracting data from the THL, the Tungsten Replicator writes the data into CSV files based on the names of the source tables. These files contain all of the row-based data, including the unique global transaction ID generated by the Tungsten Replicator during replication. The operation type (Insert or Delete) is also listed as part of the CSV file. Vertica loads the CSV files into staging tables. Tungsten executes SQL statements to perform updates on live versions of the tables. The statements use the CSV, batch-loaded information to delete old rows and insert new data into tables. The statements also perform updates as necessary to work effectively with Vertica. Download and Install Tungsten Follow these steps to download and install the Tungsten Replicator. 1. Download the software for Tungsten 4.0. 2. Unpack the file into the Tungsten Replicator deployment directory using the following command: Tungsten: shell > $>tar xvf tungsten-replicator-4.0.0-18.tar.gz Download and Install the Vertica Client Drivers Before you can connect to Vertica using Tungsten, you must install the Vertica client package. This package includes the JDBC client driver that Tungsten uses to connect to Vertica. Copyright 2015 Hewlett Packard Enterprise Development LP. Page 5
Download Vertica Client Drivers 1. Go to the Vertica Client Drivers page. 2. Download the version of the Vertica client package that is compatible with the architecture of your operating system and Vertica server version. Note: Vertica drivers are forward compatible, so you can connect to the Vertica server using previous versions of the client. For more information about client and server compatibility, see Client Driver and Server Version Compatibility in the Vertica documentation. Install Vertica Client Drivers Based on the client package you downloaded, follow the steps for installation from the Vertica documentation. Copy the Vertica JDBC driver into the Tungsten Replicator deployment directory: <Tungsten_Home>/tungsten-replicator-4.0.0-18/tungstenreplicator/lib Set up MySQL to Vertica Replication Before you begin, you must configure the staging host that is responsible for setting up the replication services involved in the replication process. For details on configuring the staging host, see Staging Host Configuration in the Tungsten documentation. Configure Hosts MySQL uses row-based replication to replicate data to Vertica. You must create tables that need to be replicated in Vertica; tables are not automatically created. You must use the ddlscan utility to create tables for replication. To prepare the MySQL and Vertica hosts, follow the steps in Preparing Hosts for Vertica Deployments. Prepare Schemas in Vertica The Tungsten Replicator does not prepare target database schemas and tables based on the source database. The Tungsten Replicator includes a tool called the ddlscan, which reads the schema definition from MySQL and translates that information into the schema definition required for the target database. You should prepare the source database tables you want to replicate to Vertica and then prepare table definitions for your Vertica database. Follow these steps to create schema and tables for the staging tables and target tables: 1. Create a table in MySQL using the following command: Copyright 2015 Hewlett Packard Enterprise Development LP. Page 6
mysql -utungsten DBNAME... mysql> CREATE TABLE Char1_Table(Id int primary key, 2. Execute the ddlscan tool twice. DataTypeSet VARCHAR (20) 'Char1', ValueDesc VARCHAR (50), Char1_Column CHAR ); default charset=utf8; Shell $> cd <Tungsten_Home>/tungsten-replicator-4.0.0-18/tungsten-replicator/samples/extensions/velocity The first time generates the live table definitions: Tungsten: shell $> /<Tungsten_Home>//tungsten-replicator- 4.0.0-18/tungsten-replicator/bin/ddlscan -user tungsten url jdbc:mysql:thin://host1:3306/dbname -pass password -template ddl-mysql-vertica.vm -db DBNAME >>ddl.sql The second time creates the table definitions for the staging data using the staging template: Tungsten: shell $> /<Tungsten_Home>//tungsten-replicator- 4.0.0-18/tungsten-replicator/bin/ddlscan -user tungsten url jdbc:mysql:thin://host1:3306/dbname -pass password -template ddl-mysql-vertica-staging.vm -db DBNAME >>ddl.sql 3. Edit ddl.sql file and update this file according to the target database. 4. Execute the Vertica database using the following command: dbadmin: shell $> vsql U dbadmin w <PASSWORD> <ddl.sql Install Vertica Replication 1. Change to the staging directory using the following command: Tungsten: shell $> cd <Tungsten_Home>/ tungsten-replicator- 4.0.0-18 2. Configure the main parameters for the replicator service: Tungsten: shell $>./tools/tpm configure alpha \ --master=host1 \ --members=host1, host2 \ --install-directory=/opt/continuent \ --disable-relay-logs=true \ --skip-validation-check=hostsfilecheck \ Copyright 2015 Hewlett Packard Enterprise Development LP. Page 7
--enable-heterogenous-service=true \ --start 3. Configure and install the MySQL Master: Tungsten: shell $>./tools/tpm update alpha\ --master=host1 \ --hosts=host1 \ --datasource-host=host1 \ --datasource-user=tungsten \ --datasource-password=password \ --datasource-mysql-conf=/usr/my.cnf \ --home-directory=/opt/continuent \ --java-file-encoding=utf8 \ --java-user-timezone=gmt \ --svc-extractor-filters=colnames,pkey \ -- property=replicator.filter.pkey.addcolumnstodeletes=true \ -- property=replicator.filter.pkey.addpkeytoinserts=true \ --mysql-use-bytes-for-string=false \ --start-and-report Note: The preceding command has some essential settings that help with heterogeneous replication: The Java VM file encoding and time zone are UTF-8 and GMT, respectively. Standardizing these values is required to avoid corrupting data in batch loads. Tungsten translates string values to UTF-8 rather than passing these values to slaves as bytes. Tungsten inserts filters to add column names and identify the primary key on tables. These additions are required for batch loading to work properly. 4. Configure and install the Vertica slave server: Tungsten: shell $>./tools/tpm update alpha \ --hosts=host2 \ --replication-user=dbadmin \ --replication-password=password \ --batch-enabled=true \ Copyright 2015 Hewlett Packard Enterprise Development LP. Page 8
--batch-load-language=js \ --batch-load-template=vertica6 \ --datasource-type=vertica \ --vertica-dbname=dbname \ --replication-host=host2 \ --replication-port=5433 \ --skip-validation-check=installermasterslavecheck \ --svc-applier-block-commit-size=25000 \ --svc-applier-block-commit-interval=30s \ --start-and-report Monitor Vertica Deployment Monitoring a Vertica replication scenario requires checking the status of both the master, which extracts data from MySQL, and the slave, which retrieves the remote THL information and applies it to Vertica. Copyright 2015 Hewlett Packard Enterprise Development LP. Page 9
The following graphic shows the master server. The output of the trepctl shows the current sequence number and applier status. The following graphic shows the slave server. The output of the trepctl shows the current sequence number and applier status. Copyright 2015 Hewlett Packard Enterprise Development LP. Page 10
Test MySQL with Vertica Replication To replicate data, you need a table on MySQL to hold some data. Follow these steps to test MySQL with Vertica using the Tungsten Replicator. The following example shows how to move a row from one table to another: 1. Log in to MySQL and insert a row: mysql -utungsten DBNAME... mysql> INSERT INTO Char1_Table VALUES(1,default, 'Empty', ''); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO Char1_Table VALUES(2,default, 'Typical', 'a'); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO Char1_Table VALUES(3,default, 'Max', 'Z'); Copyright 2015 Hewlett Packard Enterprise Development LP. Page 11
Query OK, 1 row affected (0.00 sec) 2. If you configured things properly, you should see the following on the Vertica side: dbadmin=> SELECT * from tungsten.char1_table; Id DataTypeSet ValueDesc Char1_Column ----+------------------------------------------- 1 default Empty 2 default Typical a 3 default Max Z (3 rows) Example: Use the SimpleBatchApplier The Tungsten Replicator applies data to Vertica using a new applier class called the SimpleBatchApplier. It implements the CSV files through the following process: 1. As new transactions arrive, the Replicator writes them to CSV files named after the corresponding Vertica tables. For example, if you have updates for a table named simple_tab in a schema test, the format would look like the following: Schema Table Column Type Size --------+-------------+-----------------+--------------+----- tungsten Char1_Table id int 8 tungsten Char1_Table DataTypeSet varchar(20) 20 tungsten Char1_Table ValueDesc varchar(40) 40 tungsten Char1_Table Char1_Column char 1 2. The updates go into a file named test.simple_tab. The following is an example of the data in the CSV file: "64087","I","5","Some data","some data","b","1" "64087","I","6","more data","more data","c","2" "64088","D",3",","default","Max", z,"3" 3. The CSV file includes a unique global transaction ID, an operation code (I for insert and D for delete), and the primary key. For inserts, there are additional columns that contain data; for deletes, columns contain nulls. The last column is a row number, which allows for ordering after the data is loaded into Vertica. Copyright 2015 Hewlett Packard Enterprise Development LP. Page 12
4. The Tungsten Replicator applies transactions to replicas in serial order without deviations. If you INSERT and then UPDATE a row, it always works because the Replicator applies them to the slave server in the same order. 5. The Tungsten Replicator continues writing transactions until it reaches the block commit maximum. The Replicator then closes each CSV file and loads the content into a staging table that is named according to the base name. The staging table format mimics the CSV file columns. For example, the staging table could look like the following example: Schema Table Column Type Size --------+------------------+-----------------+--------------+- -----+ tungsten stage_xxx_char1_table tungsten_seqno int 8 tungsten stage_xxx_char1_table tungsten_opcode char(1) 1 tungsten stage_xxx_char1_table id int 8 tungsten stage_xxx_char1_table DataTypeSet varchar(20) 20 tungsten stage_xxx_char1_table ValueDesc Varchar(40) 40 tungsten stage_xxx_char1_table Char1_Column Char 1 tungsten stage_xxx_char1_table tungsten_row_id int 8 6. Finally, the Replicator applies the deletes and inserts to the table test.simple_tab by executing SQL commands as shown in the following example: DELETE FROM tungsten.char1_table WHERE id IN (SELECT id FROM tungsten.stage_xxx_char1_table WHERE tungsten_opcode = 'D'); INSERT INTO tungsten.char1_table(id, DataTypeSet,ValueDesc,Char1_Column) SELECT id, DataTypeSet,ValueDesc,Char1_Column FROM tungsten.stage_xxx_char1_table AS stage_a WHERE tungsten_opcode='i' AND tungsten_row_id IN (SELECT MAX(tungsten_row_id) FROM tungsten.stage_xxx_char1_table GROUP BY id); Copyright 2015 Hewlett Packard Enterprise Development LP. Page 13
Known Limitations Tungsten currently has some important limitations for batch loading, namely: Primary keys must be a single column only. Tungsten does not handle multi-column keys. You must define the primary keys. Binary data may cause problems when converted to CSV as it converts to Unicode. Data Type Mapping: MySQL to Vertica The following table shows data type mapping between MySQL data types and Vertica data types. MySQL Data Type Vertica Data Type DATETIME DATETIME TIMESTAMP TIMESTAMP DATE DATE TIME TIME TINYINT TINYINT SMALLINT SMALLINT MEDIUMINT INT INT INT BIGINT INT VARCHAR VARCHAR CHAR CHAR BINARY BINARY VARBINARY VARBINARY TEXT, TINYTEXT, MEDIUMTEXT, LONGTEXT VARCHAR(65000) BLOB, TINYBLOB, MEDIUMBLOB, LONGBLOB VARBINARY(65000) FLOAT FLOAT Copyright 2015 Hewlett Packard Enterprise Development LP. Page 14
DOUBLE ENUM SET BIT(1) BIT DOUBLE PRECISION VARCHAR VARCHAR(4000) BOOLEAN CHAR(64) For More Information For More Information About Tungsten Vertica Community Edition Vertica Documentation Big Data and Analytics Community See http://pubs.vmware.com/continuent/t ungsten-replicator-4.0/index.html https://my.vertica.com/community/ http://my.vertica.com/docs/7.2.x/htm L/index.htm https://community.dev.hpe.com/t5/big -Data-and-Analytics/ctp/bigdata_analytics Copyright 2015 Hewlett Packard Enterprise Development LP. Page 15