Creating a Data Processor Transformation for an Unstructured Data Source

Creating a Data Processor Transformation for an Unstructured Data Source 1993-2016 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners.

Abstract Use a Data Processor transformation to map data from an unstructured data source to an XML target. The Data Processor contains a Script that identifies the source file, the target file, and the mappings between elements. The Script uses a parser that maps from unstructured data input to XML output. This article describes how to create a mapping with a Data Processor transformation that transforms unstructured data. Supported Versions Data Transformation 9.5.1 PowerCenter Express 9.5.1 Table of Contents Transform an Unstructured Data Source Overview.... 2 Scenario.... 2 XML Output Schema.... 3 Creating and Running Mapping to Transform Unstructured Data to XML.... 4 Step 1. Create a Script with a Parser in the Data Processor Transformation... 4 Step 2. Configure the Parser.... 5 Step 3. Create and Run the Mapping... 7 Import the Full Mapping.... 8 Transform an Unstructured Data Source Overview A mapping uses a Data Processor transformation to transform documents from one format to another. A parser is a Data Processor transformation object that transforms an unstructured data input source to an XML with a hierarchy structure. A Data Processor transformation uses an output schema to define the expected hierarchy of the output XML. The parser uses anchors and data holders. Anchors identify data in the input text file. Data holders identify data in the output XML file. You use marker anchors and content anchors to identify data in parser input files. You define a character or field that marks the location of a data value with a marker anchor. You define the field that contains the value with a content anchor. Scenario The Accounts Payable department of the Hudson Furniture company receives bills in PDF format. To process bills in their payment system, they need the bill details in XML format. They need to create a mapping that transforms PDF invoices into billing data. The company billing system stores the billing data in XML format. The mapping needs to use a Data Processor transformation that inputs bill details such as billing date, items ordered, item cost, and bill total, and outputs the details in a usable XML format. The following figure shows the mapping in this example: 2

The mapping contains the following objects: Read_PDF_PATH The source that contains the path to the file with billing data. Reads billing data from a PDF file. Billing_DP A Data Processor transformation that transforms unstructured data into an XML output hierarchy. Write_XML_PATH A target path to the file that stores the transformed data every time you run the mapping. The mapping uses the Read_PDF_PATH flat file to input the target path for the PDF billing files. The mapping processes and transforms the data with the m_billing mapping. Then the mapping stores the output in the target path listed in the Write_XML_PATH flat file. XML Output Schema The XML Output schema for the parser example is BillingSchema.xsd. It has the following structure: <?xml version="1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/xmlschema" attributeformdefault="unqualified" elementformdefault="unqualified"> <xs:element name="invoice"> <xs:complextype> <xs:sequence> <xs:element name="invoice_no" type="xs:float"/> <xs:element name="invoice_date" type="xs:string"/> <xs:element name="order_no" type="xs:float"/> <xs:element name="sub_total" type="xs:float"/> <xs:element name="tax" type="xs:float"/> <xs:element name="current_total" type="xs:float"/> <xs:element name="balance_due" type="xs:float"/> <xs:element maxoccurs="unbounded" minoccurs="0" name="order"> <xs:complextype> <xs:sequence> <xs:element name="quantity" type="xs:float"/> <xs:element name="product" type="xs:string"/> <xs:element name="unit_price" type="xs:float"/> <xs:element name="total" type="xs:float"/> </xs:sequence> </xs:complextype> </xs:element> </xs:sequence> <xs:attribute name="terms" type="xs:string"/> <xs:attribute name="company_name" type="xs:string"/> </xs:complextype> </xs:element> </xs:schema> The schema root is Invoice. Invoice contains elements. Within Invoice are the Invoice_No, Invoice_Date, Order_No, Sub_Total, Tax, Current_Total, and Balance_Due elements. The Invoice element also contains the Company_Name and Terms attributes. The Invoice element includes the multiple-occurring Order element that contains further elements. Within each Order element are the Quantity, Product, Unit_Price, and Total elements. 3

Creating and Running Mapping to Transform Unstructured Data to XML To implement this scenario, you can import the mapping m_billing.xml that contains the Data Processor transformation, schema, and data files already set up and ready to be used. Alternatively, you can create a Data Processor transformation using the schema and example source file from the Billing_DP.zip, then add the transformation to a mapping. After complete the mapping, you can validate, save, and run the mapping: 1. Create a Script with a Parser in a Data Processor transformation. 2. Configure the Parser. 3. Add a Data Processor transformation to the mapping. 4. Run the mapping. Step 1. Create a Script with a Parser in the Data Processor Transformation Create a Data Processor transformation and then create a Script with a parser for the transformation. When you create a parser, you must have a schema that describes the output XML document. You select the element in the schema that is the root element for the output XML. Before you begin, download the Billing_DP.zip file from the following link: https://kb.informatica.com/h2l/howto%20library/1/0474_billing_dp.zip Add the file to the <INSTALL_DIR>/tomcat/bin/source directory. Unzip the file to access the input PDF file, output schema file, and sample mapping. 1. In the Developer Data Processor transformation Objects view, click New. 2. Select Script and click Next. 3. Enter a name for the Script and click Next.. 4. By default Parser is selected. If not, select it. Enter a name for the parser. 5. The Script component is the first component to process data in the transformation, so enable Set as startup component. Click Next. 6. To add a schema that defines the output, select Add reference to a Schema Object. Click Create a new schema object to import a new Schema object and browse for the BillingSchema.xsd file in the <INSTALL_DIR>/tomcat/bin/source directory. 7. To add a sample PDF file that you can use to test the parser, browse for and select the Billing.pdf file in the <INSTALL_DIR>/tomcat/bin/source directory. You can change the sample PDF file. 8. Click Finish. The Developer tool creates a view for each parser or other Script object that you create. Click the view to configure the parser. 4

Step 2. Configure the Parser Configure a Data Processor transformation Parser in the IntelliScript editor. To create mapping statements, first define marker anchors and content anchors for each data value in the PDF sample file. Then define data holders that identify the XML hierarchy element that is associated with each unstructured data element. 1. To open the IntelliScript editor, click the Script object. The IntelliScript editor displays the parser: 2. To preview the example source in text, perform the following a. Next to the example_source property, double click the equals sign and select LocalFile. b. Expand the example_source property and then click the double right arrows. c. Next to the pre_processor property, double click the equals sign and select PDFToTxt_4. d. Next to the format property, double click the equals sign and select TextFormat. 3. To define a content anchor that shows where the parser reads the company name, perform the following a. In the Data Viewer view, near the top of the example source, find and select the text Container Shipping Inc., that marks the text to parse. b. Right-click, and then select Insert Content. 4. To transform the text Container Shipping Inc. into the Company_Name element in the output XML data, perform the following a. in the IntelliScript Editor view, find the Content anchor and the data_holder property that it contains. b. Double click the data_holder property to display the Choose Node picker. c. Expand the no target namespace element and select the /Invoice/@Company_Name output node. Then, click OK. 5. To define a marker anchor that identifies where to find the invoice number value, perform the following a. In the Data Viewer view, find and select the text INVOICE NUMBER, that marks the location of the value. b. Right-click, and then select Insert Marker. 6. To define a content anchor that shows where the parser reads the value of the invoice number, perform the following a. In the Data Viewer view, find and select the text 536524, that marks the text to parse. b. Right-click, and then select Insert Content. 7. To transform the invoice number into the Invoice_No element in the output XML data, perform the following a. In the IntelliScript Editor view, find the Content anchor and the data_holder property that it contains. b. Double click the data_holder property to display the Choose Node picker. c. Expand the no target namespace element and select the /Invoice/*s/Invoice_No output node. Then, click OK. 5

8. To transform the invoice date into the Invoice_Date element in the output XML data, perform the following a. In the Data Viewer view, find and define the INVOICE DATE text as a Marker anchor. b. Find and select the text December 24, 2013 and define the text as a Content anchor. c. In the IntelliScript Editor view, find the Content anchor and change the closing_marker to NewlineSearch, in case the date is longer than in the example source. d. Double click the data_holder property, and in the Choose Node picker, expand the nodes to select the / Invoice/*s/Invoice_Date. Then click OK. 9. To transform the order number into the Order_No element in the output XML data, perform the following a. In the Data Viewer view, find and define the YOUR ORDER NO text as a Marker anchor. b. Find and select the text 1892727 and define the text as a Content anchor. c. In the IntelliScript Editor view, double click the data_holder property, and in the Choose Node picker, expand the nodes to select the /Invoice/*s/Order_No element. Then click OK. 10. To transform the invoice terms data into the Terms element in the output XML data, perform the following a. In the Data Viewer view, find and define the TERMS text as a Marker anchor. b. Find and select the text Net 30 and define the text as a Content anchor. c. In the IntelliScript Editor view, find the Content anchor and change the closing_marker to NewlineSearch, in case the order number is longer than in the example source. d. Double click the data_holder property, and in the Choose Node picker, expand the nodes to select the / Invoice/@Terms element. Then click OK. 11. To transform the order inventory data, add a group to hold a logical set of statements and a repeating group to process each line of the order. Perform the following a. In the IntelliScript Editor view, double-click the last heavy double-arrows under the parser element and select Group. b. In the Data Viewer view, find and define the QUANTITY text as a Marker anchor. c. In the IntelliScript Editor view, double-click the heavy double-arrows under the Group element and select RepeatingGroup. d. Expand the RepeatingGroup element and change the value for separator to Marker. e. Expand the separator element and change the value for search to NewlineSearch. f. To parse the quantity value for each line of the order, create a content marker for that value. Doubleclick the heavy double-arrows under the RepeatingGroup element and select Content. g. To assign the quantity value to the Quantity element in the XML output, expand the Content anchor and double-click the data_holder element. Expand the nodes to select /Invoice/*s/Order/*s/Quantity. The data holder type is a number, so the parser takes the first number in each line as the quantity value. h. To parse the product value for each line of the order, create a content marker for that value. Double-click the heavy double-arrows under the previous element and select Content. i. Expand the Content element and change the value for phase to final. j. To assign the product name to the Product element in the XML output, expand the Content anchor and double-click the data_holder element. Expand the nodes to select /Invoice/*s/Order/*s/Product. The data holder type is a string, so the parser takes the string in each line as the product value. The data holder type is a string, so the parser takes the string in each line as the product value. 6

k. Collapse the Group element. 12. To transform the invoice subtotal to the Sub_Total element in the XML output, perform the following a. In the Data Viewer view, find and define the SUBTOTAL text as a Marker anchor. b. Find and select the text 1450.00 and define the text as a Content anchor. c. In the IntelliScript Editor view, expand the Content anchor and double-click the data_holder element. Expand the nodes to select /Invoice/*s/Sub_Total. 13. To transform the tax to the Tax element in the XML output, perform the following a. In the Data Viewer view, find and define the TAX text as a Marker anchor. b. Find and select the text 101.50 and define the text as a Content anchor. c. In the IntelliScript Editor view, expand the Content anchor and double-click the data_holder element. Expand the nodes to select /Invoice/*s/Tax. 14. To transform the invoice total to the Current_Total element in the XML output, perform the following a. In the Data Viewer view, find and define the Total text as a Marker anchor. b. Find and select the text 1551.50 and define the text as a Content anchor. c. In the IntelliScript Editor view, expand the Content anchor and double-click the data_holder element. Expand the nodes to select /Invoice/*s/Current_Total. 15. To transform the invoice total to the Balance_Due element in the XML output, perform the following a. In the IntelliScript Editor view, create a Map statement. Double-click the heavy double-arrows under the previous element and select Map. b. Expand the Map statement and double-click the source element. Expand the nodes to select / Invoice/*s/Current_Total. c. Double-click the target element. Expand the nodes to select /Invoice/*s/Balance_Due. Step 3. Create and Run the Mapping You can add the Data Processor transformation to a mapping and run the mapping. 1. In the Object Explorer view, create a mapping or select a mapping and select Open Mapping. 2. From the Object Explorer view, drag the Data Processor transformation into the editor. The following figure shows the mapping: 3. From the Object Explorer view, drag the PDF_Path physical data object into the editor. Select Read to add the object as a source. The source appears in the editor. Drag the PDF_input port in the source to the Input input port in the Data Processor transformation. When the mapping runs it reads input from the file designated by the path in the PDF_Path file. 4. From the Object Explorer view, drag XML_Path physical data object into the editor. Select Write to add the object as a target. The target appears in the editor. Drag the Output output port in the Data Processor transformation to the XML_output input port in the target. 7

When the mapping runs it reads input from the file designated by the path in the XML_Path file. The following figure shows the mapping: 5. Right-click in the editor, and select Run Mapping. Review the target flat file to see the mapping results. Import the Full Mapping If you want to check a prepared example mapping, you can import the full example mapping. The mapping contains the source flat file, transformation, and target flat file for the mapping. 1. In the Object Explorer view, select the folder where you want to create the mapping. 2. Right-click the folder and select Import. 3. Select Informatica > Import Object Metadata File (Advanced). 4. Browse for the m_pdf_mapping.xml file. 5. In the Project Import dialog box, select a folder and click Add Content to Target. For convenience, you can select a folder where you store practice examples. 6. Click Next, click Next, and click Finish. The Model Repository adds the m_pdf_mapping mapping, the Billing_DP Data Processor transformation, and the output schema. The m_pdf_mapping mapping opens in the Object Explorer view. Author Rachel Bell Technical Writer 8