Creating a Data Processor Transformation for an Unstructured Data Source

Similar documents
keyon Luna SA Monitor Service Administration Guide 1 P a g e Version Autor Date Comment

Secure Agent Quick Start for Windows

+ <xs:element name="productsubtype" type="xs:string" minoccurs="0"/>

StreamServe Persuasion SP4 StreamServe Connect for SAP - Business Processes

Data Domain Profiling and Data Masking for Hadoop

How to Configure a Secure Connection to Microsoft SQL Server

Oracle Java CAPS Message Library for EDIFACT User's Guide

Configure Managed File Transfer Endpoints

Data Domain Discovery in Test Data Management

StreamServe Persuasion SP5 Document Broker Plus

Create, Link, or Edit a GPO with Active Directory Users and Computers

Creating a Custom Logger to Log Database Access Outside of Business Hours

Using Microsoft Windows Authentication for Microsoft SQL Server Connections in Data Archive

Adobe Acrobat 9 Deployment on Microsoft Systems Management

Modernize your NonStop COBOL Applications with XML Thunder September 29, 2009 Mike Bonham, TIC Software John Russell, Canam Software

StreamServe Persuasion SP5 Control Center

Gplus Adapter 8.0. for Siebel CRM. Developer s Guide

<!--=========================================--> <!--=========================================-->

Service Description: NIH GovTrip - NBS Web Service

StreamServe Persuasion SP4

Design Better Products. SolidWorks What s New for PDMWorks Enterprise

Wavecrest Certificate

bbc Creating a Purchase Order Form Adobe LiveCycle Designer ES2 November 2009 Version 9

Qlik REST Connector Installation and User Guide

Excel will open with the report displayed. You can format and/or save the report as desired.

User Document. Adobe Acrobat 7.0 for Microsoft Windows Group Policy Objects and Active Directory

Application. 1.1 About This Tutorial Tutorial Requirements Provided Files

Version 5.0. SurfControl Web Filter for Citrix Installation Guide for Service Pack 2

PaperStream Connect. Setup Guide. Version Copyright Fujitsu

DocuSign Connect Guide

Sophos Endpoint Security and Control How to deploy through Citrix Receiver 2.0

Interact for Microsoft Office

Configure an ODBC Connection to SAP HANA

LPR for Windows 95/98/Me/2000/XP TCP/IP Printing User s Guide. Rev. 03 (November, 2001)

Installation of IR under Windows Server 2008

Sophos Anti-Virus standalone startup guide. For Windows and Mac OS X

Informatica PowerCenter Express (Version 9.5.1) Getting Started Guide

Configuring Notification for Business Glossary

Plug-In for Informatica Guide

Data Integration Hub for a Hybrid Paper Search

User Guide. Informatica Smart Plug-in for HP Operations Manager. (Version 8.5.1)

IBM Operational Decision Manager Version 8 Release 5. Getting Started with Business Rules

USING STUFFIT DELUXE THE STUFFIT START PAGE CREATING ARCHIVES (COMPRESSED FILES)

JAVS Scheduled Publishing. Installation/Configuration... 4 Manual Operation... 6 Automating Scheduled Publishing... 7 Windows XP... 7 Windows 7...

DP-313 Wireless Print Server

COM_2006_023_02.xsd <?xml version="1.0" encoding="utf-8"?> <xs:schema xmlns:xs=" elementformdefault="qualified">

Reading and Writing Files Using the File Utilities service

Oracle Application Express - Application Migration Workshop

SECURE MOBILE ACCESS MODULE USER GUIDE EFT 2013

IBM Configuring Rational Insight and later for Rational Asset Manager

<xs:complextype name="trescdokumentu_typ">

StreamServe Persuasion SP5 Upgrading instructions

Galaxy Software Addendum

Dell Statistica Document Management System (SDMS) Installation Instructions

Thermomark Roll - Driver Installation - Windows 7

GUARD1 PLUS Mini-Attendant File Manager User's Guide Version 2.71

How to Create Your Own Crystal Report

Informatica Cloud & Redshift Getting Started User Guide

Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM

Configuring Hadoop Distributed File Service as an Optimized File Archive Store

Publishing Geoprocessing Services Tutorial

for Invoice Processing Installation Guide

Oracle SOA Suite 11g Oracle SOA Suite 11g HL7 Inbound Example

CA ERwin Data Modeler

No Trade Secrets. Microsoft does not claim any trade secret rights in this documentation.

FDOT Construction Software Release Notes and Installation Guide

SQL Server 2005 Reporting Services (SSRS)

User manual for e-line DNB: the XML import file. User manual for e-line DNB: the XML import file

Auditing manual. Archive Manager. Publication Date: November, 2015

Sophos Anti-Virus for NetApp Storage Systems startup guide

Connect to an SSL-Enabled Microsoft SQL Server Database from PowerCenter on UNIX/Linux

Sitecore InDesign Connector 1.1

Scribe Online Integration Services (IS) Tutorial

Impact+OCR 1.1 Readme

ER/Studio 8.0 New Features Guide

StarWind SMI-S Agent: Storage Provider for SCVMM April 2012

Creating XML Report Web Services

SARANGSoft WinBackup Business v2.5 Client Installation Guide

FTP, IIS, and Firewall Reference and Troubleshooting

Postscript Printer Descriptions Installation and Release Notes

Data Movement Modeling PowerDesigner 16.1

Crystal Reports Integration Plugin for JIRA

UF Health SharePoint 2010 Document Libraries

Running a Workflow on a PowerCenter Grid

ORACLE BUSINESS INTELLIGENCE WORKSHOP

StarWind iscsi SAN & NAS: Configuring HA File Server on Windows Server 2012 for SMB NAS January 2013

LPR for Windows 95 TCP/IP Printing User s Guide

Reference and Troubleshooting: FTP, IIS, and Firewall Information

DigitalPersona Pro Server for Active Directory v4.x Quick Start Installation Guide

Printer Sharing of the PT-9500pc in a Windows Environment

Feith Rules Engine Version 8.1 Install Guide

Endeca RAD Toolkit for ASP.NET. Developer's Guide Version March 2012

Novell Identity Manager

SMART Sync Windows operating systems. System administrator s guide

Moving Rockwell Software Activation Keys to the VersaView 200R Industrial Computer

Oracle Hyperion Financial Data Quality Management, Fusion Edition ERP Source Adapter for SAP Financials. Readme. Purpose. Overview.

ELM Server Exchange Edition Virtual Archive Mailbox version 5.5

PDF AutoMail utility Auto batch PDF Tool. User Documentation

Installation Guide Revision 1.0.

Embarcadero DB Change Manager 6.0 and DB Change Manager XE2

Transcription:

Creating a Data Processor Transformation for an Unstructured Data Source 1993-2016 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners.

Abstract Use a Data Processor transformation to map data from an unstructured data source to an XML target. The Data Processor contains a Script that identifies the source file, the target file, and the mappings between elements. The Script uses a parser that maps from unstructured data input to XML output. This article describes how to create a mapping with a Data Processor transformation that transforms unstructured data. Supported Versions Data Transformation 9.5.1 PowerCenter Express 9.5.1 Table of Contents Transform an Unstructured Data Source Overview.... 2 Scenario.... 2 XML Output Schema.... 3 Creating and Running Mapping to Transform Unstructured Data to XML.... 4 Step 1. Create a Script with a Parser in the Data Processor Transformation... 4 Step 2. Configure the Parser.... 5 Step 3. Create and Run the Mapping... 7 Import the Full Mapping.... 8 Transform an Unstructured Data Source Overview A mapping uses a Data Processor transformation to transform documents from one format to another. A parser is a Data Processor transformation object that transforms an unstructured data input source to an XML with a hierarchy structure. A Data Processor transformation uses an output schema to define the expected hierarchy of the output XML. The parser uses anchors and data holders. Anchors identify data in the input text file. Data holders identify data in the output XML file. You use marker anchors and content anchors to identify data in parser input files. You define a character or field that marks the location of a data value with a marker anchor. You define the field that contains the value with a content anchor. Scenario The Accounts Payable department of the Hudson Furniture company receives bills in PDF format. To process bills in their payment system, they need the bill details in XML format. They need to create a mapping that transforms PDF invoices into billing data. The company billing system stores the billing data in XML format. The mapping needs to use a Data Processor transformation that inputs bill details such as billing date, items ordered, item cost, and bill total, and outputs the details in a usable XML format. The following figure shows the mapping in this example: 2

The mapping contains the following objects: Read_PDF_PATH The source that contains the path to the file with billing data. Reads billing data from a PDF file. Billing_DP A Data Processor transformation that transforms unstructured data into an XML output hierarchy. Write_XML_PATH A target path to the file that stores the transformed data every time you run the mapping. The mapping uses the Read_PDF_PATH flat file to input the target path for the PDF billing files. The mapping processes and transforms the data with the m_billing mapping. Then the mapping stores the output in the target path listed in the Write_XML_PATH flat file. XML Output Schema The XML Output schema for the parser example is BillingSchema.xsd. It has the following structure: <?xml version="1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/xmlschema" attributeformdefault="unqualified" elementformdefault="unqualified"> <xs:element name="invoice"> <xs:complextype> <xs:sequence> <xs:element name="invoice_no" type="xs:float"/> <xs:element name="invoice_date" type="xs:string"/> <xs:element name="order_no" type="xs:float"/> <xs:element name="sub_total" type="xs:float"/> <xs:element name="tax" type="xs:float"/> <xs:element name="current_total" type="xs:float"/> <xs:element name="balance_due" type="xs:float"/> <xs:element maxoccurs="unbounded" minoccurs="0" name="order"> <xs:complextype> <xs:sequence> <xs:element name="quantity" type="xs:float"/> <xs:element name="product" type="xs:string"/> <xs:element name="unit_price" type="xs:float"/> <xs:element name="total" type="xs:float"/> </xs:sequence> </xs:complextype> </xs:element> </xs:sequence> <xs:attribute name="terms" type="xs:string"/> <xs:attribute name="company_name" type="xs:string"/> </xs:complextype> </xs:element> </xs:schema> The schema root is Invoice. Invoice contains elements. Within Invoice are the Invoice_No, Invoice_Date, Order_No, Sub_Total, Tax, Current_Total, and Balance_Due elements. The Invoice element also contains the Company_Name and Terms attributes. The Invoice element includes the multiple-occurring Order element that contains further elements. Within each Order element are the Quantity, Product, Unit_Price, and Total elements. 3

Creating and Running Mapping to Transform Unstructured Data to XML To implement this scenario, you can import the mapping m_billing.xml that contains the Data Processor transformation, schema, and data files already set up and ready to be used. Alternatively, you can create a Data Processor transformation using the schema and example source file from the Billing_DP.zip, then add the transformation to a mapping. After complete the mapping, you can validate, save, and run the mapping: 1. Create a Script with a Parser in a Data Processor transformation. 2. Configure the Parser. 3. Add a Data Processor transformation to the mapping. 4. Run the mapping. Step 1. Create a Script with a Parser in the Data Processor Transformation Create a Data Processor transformation and then create a Script with a parser for the transformation. When you create a parser, you must have a schema that describes the output XML document. You select the element in the schema that is the root element for the output XML. Before you begin, download the Billing_DP.zip file from the following link: https://kb.informatica.com/h2l/howto%20library/1/0474_billing_dp.zip Add the file to the <INSTALL_DIR>/tomcat/bin/source directory. Unzip the file to access the input PDF file, output schema file, and sample mapping. 1. In the Developer Data Processor transformation Objects view, click New. 2. Select Script and click Next. 3. Enter a name for the Script and click Next.. 4. By default Parser is selected. If not, select it. Enter a name for the parser. 5. The Script component is the first component to process data in the transformation, so enable Set as startup component. Click Next. 6. To add a schema that defines the output, select Add reference to a Schema Object. Click Create a new schema object to import a new Schema object and browse for the BillingSchema.xsd file in the <INSTALL_DIR>/tomcat/bin/source directory. 7. To add a sample PDF file that you can use to test the parser, browse for and select the Billing.pdf file in the <INSTALL_DIR>/tomcat/bin/source directory. You can change the sample PDF file. 8. Click Finish. The Developer tool creates a view for each parser or other Script object that you create. Click the view to configure the parser. 4

Step 2. Configure the Parser Configure a Data Processor transformation Parser in the IntelliScript editor. To create mapping statements, first define marker anchors and content anchors for each data value in the PDF sample file. Then define data holders that identify the XML hierarchy element that is associated with each unstructured data element. 1. To open the IntelliScript editor, click the Script object. The IntelliScript editor displays the parser: 2. To preview the example source in text, perform the following a. Next to the example_source property, double click the equals sign and select LocalFile. b. Expand the example_source property and then click the double right arrows. c. Next to the pre_processor property, double click the equals sign and select PDFToTxt_4. d. Next to the format property, double click the equals sign and select TextFormat. 3. To define a content anchor that shows where the parser reads the company name, perform the following a. In the Data Viewer view, near the top of the example source, find and select the text Container Shipping Inc., that marks the text to parse. b. Right-click, and then select Insert Content. 4. To transform the text Container Shipping Inc. into the Company_Name element in the output XML data, perform the following a. in the IntelliScript Editor view, find the Content anchor and the data_holder property that it contains. b. Double click the data_holder property to display the Choose Node picker. c. Expand the no target namespace element and select the /Invoice/@Company_Name output node. Then, click OK. 5. To define a marker anchor that identifies where to find the invoice number value, perform the following a. In the Data Viewer view, find and select the text INVOICE NUMBER, that marks the location of the value. b. Right-click, and then select Insert Marker. 6. To define a content anchor that shows where the parser reads the value of the invoice number, perform the following a. In the Data Viewer view, find and select the text 536524, that marks the text to parse. b. Right-click, and then select Insert Content. 7. To transform the invoice number into the Invoice_No element in the output XML data, perform the following a. In the IntelliScript Editor view, find the Content anchor and the data_holder property that it contains. b. Double click the data_holder property to display the Choose Node picker. c. Expand the no target namespace element and select the /Invoice/*s/Invoice_No output node. Then, click OK. 5

8. To transform the invoice date into the Invoice_Date element in the output XML data, perform the following a. In the Data Viewer view, find and define the INVOICE DATE text as a Marker anchor. b. Find and select the text December 24, 2013 and define the text as a Content anchor. c. In the IntelliScript Editor view, find the Content anchor and change the closing_marker to NewlineSearch, in case the date is longer than in the example source. d. Double click the data_holder property, and in the Choose Node picker, expand the nodes to select the / Invoice/*s/Invoice_Date. Then click OK. 9. To transform the order number into the Order_No element in the output XML data, perform the following a. In the Data Viewer view, find and define the YOUR ORDER NO text as a Marker anchor. b. Find and select the text 1892727 and define the text as a Content anchor. c. In the IntelliScript Editor view, double click the data_holder property, and in the Choose Node picker, expand the nodes to select the /Invoice/*s/Order_No element. Then click OK. 10. To transform the invoice terms data into the Terms element in the output XML data, perform the following a. In the Data Viewer view, find and define the TERMS text as a Marker anchor. b. Find and select the text Net 30 and define the text as a Content anchor. c. In the IntelliScript Editor view, find the Content anchor and change the closing_marker to NewlineSearch, in case the order number is longer than in the example source. d. Double click the data_holder property, and in the Choose Node picker, expand the nodes to select the / Invoice/@Terms element. Then click OK. 11. To transform the order inventory data, add a group to hold a logical set of statements and a repeating group to process each line of the order. Perform the following a. In the IntelliScript Editor view, double-click the last heavy double-arrows under the parser element and select Group. b. In the Data Viewer view, find and define the QUANTITY text as a Marker anchor. c. In the IntelliScript Editor view, double-click the heavy double-arrows under the Group element and select RepeatingGroup. d. Expand the RepeatingGroup element and change the value for separator to Marker. e. Expand the separator element and change the value for search to NewlineSearch. f. To parse the quantity value for each line of the order, create a content marker for that value. Doubleclick the heavy double-arrows under the RepeatingGroup element and select Content. g. To assign the quantity value to the Quantity element in the XML output, expand the Content anchor and double-click the data_holder element. Expand the nodes to select /Invoice/*s/Order/*s/Quantity. The data holder type is a number, so the parser takes the first number in each line as the quantity value. h. To parse the product value for each line of the order, create a content marker for that value. Double-click the heavy double-arrows under the previous element and select Content. i. Expand the Content element and change the value for phase to final. j. To assign the product name to the Product element in the XML output, expand the Content anchor and double-click the data_holder element. Expand the nodes to select /Invoice/*s/Order/*s/Product. The data holder type is a string, so the parser takes the string in each line as the product value. The data holder type is a string, so the parser takes the string in each line as the product value. 6

k. Collapse the Group element. 12. To transform the invoice subtotal to the Sub_Total element in the XML output, perform the following a. In the Data Viewer view, find and define the SUBTOTAL text as a Marker anchor. b. Find and select the text 1450.00 and define the text as a Content anchor. c. In the IntelliScript Editor view, expand the Content anchor and double-click the data_holder element. Expand the nodes to select /Invoice/*s/Sub_Total. 13. To transform the tax to the Tax element in the XML output, perform the following a. In the Data Viewer view, find and define the TAX text as a Marker anchor. b. Find and select the text 101.50 and define the text as a Content anchor. c. In the IntelliScript Editor view, expand the Content anchor and double-click the data_holder element. Expand the nodes to select /Invoice/*s/Tax. 14. To transform the invoice total to the Current_Total element in the XML output, perform the following a. In the Data Viewer view, find and define the Total text as a Marker anchor. b. Find and select the text 1551.50 and define the text as a Content anchor. c. In the IntelliScript Editor view, expand the Content anchor and double-click the data_holder element. Expand the nodes to select /Invoice/*s/Current_Total. 15. To transform the invoice total to the Balance_Due element in the XML output, perform the following a. In the IntelliScript Editor view, create a Map statement. Double-click the heavy double-arrows under the previous element and select Map. b. Expand the Map statement and double-click the source element. Expand the nodes to select / Invoice/*s/Current_Total. c. Double-click the target element. Expand the nodes to select /Invoice/*s/Balance_Due. Step 3. Create and Run the Mapping You can add the Data Processor transformation to a mapping and run the mapping. 1. In the Object Explorer view, create a mapping or select a mapping and select Open Mapping. 2. From the Object Explorer view, drag the Data Processor transformation into the editor. The following figure shows the mapping: 3. From the Object Explorer view, drag the PDF_Path physical data object into the editor. Select Read to add the object as a source. The source appears in the editor. Drag the PDF_input port in the source to the Input input port in the Data Processor transformation. When the mapping runs it reads input from the file designated by the path in the PDF_Path file. 4. From the Object Explorer view, drag XML_Path physical data object into the editor. Select Write to add the object as a target. The target appears in the editor. Drag the Output output port in the Data Processor transformation to the XML_output input port in the target. 7

When the mapping runs it reads input from the file designated by the path in the XML_Path file. The following figure shows the mapping: 5. Right-click in the editor, and select Run Mapping. Review the target flat file to see the mapping results. Import the Full Mapping If you want to check a prepared example mapping, you can import the full example mapping. The mapping contains the source flat file, transformation, and target flat file for the mapping. 1. In the Object Explorer view, select the folder where you want to create the mapping. 2. Right-click the folder and select Import. 3. Select Informatica > Import Object Metadata File (Advanced). 4. Browse for the m_pdf_mapping.xml file. 5. In the Project Import dialog box, select a folder and click Add Content to Target. For convenience, you can select a folder where you store practice examples. 6. Click Next, click Next, and click Finish. The Model Repository adds the m_pdf_mapping mapping, the Billing_DP Data Processor transformation, and the output schema. The m_pdf_mapping mapping opens in the Object Explorer view. Author Rachel Bell Technical Writer 8