Parsing a PDF File with PowerCenter

Similar documents
+ <xs:element name="productsubtype" type="xs:string" minoccurs="0"/>

COM_2006_023_02.xsd <?xml version="1.0" encoding="utf-8"?> <xs:schema xmlns:xs=" elementformdefault="qualified">

Data Integration Hub for a Hybrid Paper Search

Service Description: NIH GovTrip - NBS Web Service

User manual for e-line DNB: the XML import file. User manual for e-line DNB: the XML import file

keyon Luna SA Monitor Service Administration Guide 1 P a g e Version Autor Date Comment

Gplus Adapter 8.0. for Siebel CRM. Developer s Guide

Modernize your NonStop COBOL Applications with XML Thunder September 29, 2009 Mike Bonham, TIC Software John Russell, Canam Software

DocuSign Connect Guide

<xs:complextype name="trescdokumentu_typ">

<!--=========================================--> <!--=========================================-->

Oracle Java CAPS Message Library for EDIFACT User's Guide

The Direct Project. Implementation Guide for Direct Project Trust Bundle Distribution. Version March 2013

DRAFT. Standard Definition. Extensible Event Stream. Christian W. Günther Fluxicon Process Laboratories

No Trade Secrets. Microsoft does not claim any trade secret rights in this documentation.

StreamServe Persuasion SP4 StreamServe Connect for SAP - Business Processes

Configuring an Oracle Business Intelligence Enterprise Edition Resource in Metadata Manager

Schema XSD opisująca typy dokumentów obsługiwane w Systemie invooclip

CA ERwin Data Modeler

Informatica Corporation Proactive Monitoring for PowerCenter Operations Version 3.0 Release Notes May 2014

Stored Documents and the FileCabinet

Informatica PowerCenter Express (Version 9.5.1) Getting Started Guide

Introduction to XML Applications

Advanced PDF workflows with ColdFusion

<xs:restriction base="xs:string">

SharePoint Integration Framework Developers Cookbook

SyAM Software Management Utilities. Performing a Power Audit

Plug-In for Informatica Guide

Oracle Universal Content Management

New Features... 1 Installation... 3 Upgrade Changes... 3 Fixed Limitations... 4 Known Limitations... 5 Informatica Global Customer Support...

Archivio Sp. z o.o. Schema XSD opisująca typy dokumentów obsługiwane w Systemie Invo24

Endeca RAD Toolkit for ASP.NET. Developer's Guide Version March 2012

How to Define Authorizations

Introduction to Client Online. Factoring Guide

Archiving Full Resolution Images

Database Studio is the new tool to administrate SAP MaxDB database instances as of version 7.5.

Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM

Qlik REST Connector Installation and User Guide

Oracle Fusion Middleware

TaskCentre v4.5 Run Crystal Report Tool White Paper

Configure Managed File Transfer Endpoints

Configuring Notification for Business Glossary

Searching your Archive in Outlook (Normal)

Data Integration with Talend Open Studio Robert A. Nisbet, Ph.D.

Excel 2007 Tutorials - Video File Attributes

Installing and configuring Microsoft Reporting Services

Autodesk Inventory Advisor Quick Start Guide

ASPIRE Programmable Language and Engine

ibolt V3.2 Release Notes

[MS-DVRD]: Device Registration Discovery Protocol. Intellectual Property Rights Notice for Open Specifications Documentation

Data Domain Profiling and Data Masking for Hadoop

Oracle Fusion Middleware

Microsoft Visual Studio Integration Guide

Copyright Notice. ISBN: N/A SWsoft Sunrise Valley Drive Suite 600 Herndon VA USA Phone: +1 (703) Fax: +1 (703)

XIII. Service Oriented Computing. Laurea Triennale in Informatica Corso di Ingegneria del Software I A.A. 2006/2007 Andrea Polini

Advanced Information Management

Software Application Tutorial

How to Configure a Secure Connection to Microsoft SQL Server

Specify the location of an HTML control stored in the application repository. See Using the XPath search method, page 2.

jexcel plugin user manual v0.2

edm RIE Export Plugin

Exploring Microsoft Office Access Chapter 2: Relational Databases and Multi-Table Queries

Importing TSM Data into Microsoft Excel using Microsoft Query

D4.1.2 Cloud-based Data Storage (Prototype II)

Analytics Canvas Tutorial: Cleaning Website Referral Traffic Data. N m o d a l S o l u t i o n s I n c. A l l R i g h t s R e s e r v e d

National Frozen Foods Case Study

Jet Data Manager 2012 User Guide

MedBiquitous Web Services Design Guidelines

EXCEL XML SPREADSHEET TUTORIAL

Configuring Hadoop Distributed File Service as an Optimized File Archive Store

ICE Trade Vault. Public User & Technology Guide June 6, 2014

Using LDAP Authentication in a PowerCenter Domain

Connect to an SSL-Enabled Microsoft SQL Server Database from PowerCenter on UNIX/Linux

Toad for Data Analysts, Tips n Tricks

Electronic Remittance Advice (ERA) Processor

Excel 2003 Tutorials - Video File Attributes

New Features in Neuron ESB 2.6

IBM WebSphere Adapter for Quick Start Tutorials

4. The Third Stage In Designing A Database Is When We Analyze Our Tables More Closely And Create A Between Tables

Query. Training and Participation Guide Financials 9.2

VMware vcenter Discovered Machines Import Tool User's Guide Version for vcenter Configuration Manager 5.3

New Features in Sage BusinessVision 2013 (version 7.6)

CAS Protocol 3.0 specification

Accounts Receivable: Importing Remittance Data

Synergist Spotlight on

Permissions Management for Site Admins

Estimating and Vendor Quotes An Estimating and Vendor Quote Workflow Guide

Cloud Services. Archiving. End User Guide

Session Initiation Protocol (SIP) Registration Extensions

Data Domain Discovery in Test Data Management

FreeForm Designer. Phone: Fax: POB 8792, Natanya, Israel Document2

IBM Information Server

11 ways to migrate Lotus Notes applications to SharePoint and Office 365

Service Desk R11.2 Upgrade Procedure - How to export data from USD into MS Excel

Transcription:

Parsing a PDF File with PowerCenter 2010 Informatica

Abstract You can parse data from a PDF file with a PowerCenter mapping. Define the PDF file as a Data Transformation source. This article describes how to configure the Data Transformation source to interface with a Data Transformation service. Supported Versions PowerCenter 9.0.1 Table of Contents Overview.... 2 Mapping Overview.... 3 PDF File Structure.... 3 Create the Data Transformation Source.... 5 Export the XML File Structure.... 7 Create the Target Definition.... 7 Data Transformation.... 8 Create the Data Transformation Project.... 8 Deploy the Data Transformation Project.... 9 Define the Service Name.... 10 Configure the Workflow.... 10 Run the Workflow.... 11 Overview A PDF is a common file format that stores invoices and account statements. You can configure a PowerCenter mapping to extract the data from the PDF when the page layout is the same for each invoice. Configure a Data Transformation source in the PowerCenter Designer. This article explains how to configure a Data Transformation source that represents a multiple page PDF file. The article shows how to configure the PowerCenter source with a Data Transformation service to extract the data from the PDF file. The target is a set of relational tables. To parse the data from a PDF file, complete the following tasks: Create the Data Transformation source in the Designer. Export the structure as an XML schema from the Designer. Create a Data Transformation Parser project in the Data Transformation Developer Studio. Import the XML schema that you created in PowerCenter. Deploy the project as a Data Transformation service. Deploy the project to the Data Transformation repository local to the PowerCenter Client. Deploy another copy of the service to the Data Transformation repository local to the PowerCenter Integration Service. Define the Data Transformation service name in the Data Transformation source. Create and run the PowerCenter workflow. 2

Mapping Overview Create a PowerCenter mapping to parse the data from the PDF file and pass the data to relational targets. The following figure shows the mapping in the Designer: The mapping contains the following objects: Data Transformation source The Data Transformation source is a PDF file. The Integration Service calls a Data Transformation service to parse the data from the PDF. The Data Transformation service returns XML to the Integration Service. The Data Transformation source passes row data to the pipeline. Application Multi-Group Source Qualifier Targets The Source Qualifier transformation represents the rows that the Integration Service passes to the target. When you add the Data Transformation source to the mapping, the Designer creates a source qualifier by default. The targets are the Invoice Header, Buyer Total, and Transaction Detail tables. PDF File Structure The PDF source file is multiple page invoice. The first page contains the customer name, the address, and the account number. The page includes a summary of the current charges and the total balance due. The first page also contains advertising and other text that you do not need to extract. The second page contains a list of the charges sorted by buyer. Each buyer has multiple charges. You can view a sample PDF file in the Data Transformation tutorial #3. The PDF file is OrshavaInvoice.pdf. 3

The following figure shows the first page of the PDF: The second PDF page contains transactions by buyer name: 4

Create the Data Transformation Source Create the Data Transformation source in the PowerCenter Source Analyzer. When you create a Data Transformation source, the Designer creates the following default ports: InputFileName Returns the name of the current input file. OutputBuffer Output port that returns XML from a Data Transformation service if you do not create output ports. When you define ports on the Output Hierarchy tab, the OutputBuffer does not return data. 5

To pass row data to the relational tables, configure output ports on the Output Hierarchy tab. Create a hierarchy of groups in the left pane of the Output Hierarchy tab. All groups are under the root group. Each group can contain ports and other groups. The group structure represents the relationship between target tables. When you define a group within a group, you define a parent-child relationship between the groups. The Designer defines a primary key-foreign key relationship between the groups with a generated key. The following figure shows the Output Hierarchy tab: Define the following groups of ports to represent the invoice database tables: Group1 Invoice Header Account. Customer account number. Period Ending. Date of current charges. Current Total. Total amount of purchases for the period. Balance Due. Total amount due including past due charges. Group2 Buyer Total Name. Name of the buyer that purchased the products. Total. Total cost of the products for the buyer. Group3 Transaction Detail Date. Purchase date. Ref. Purchase reference number. Product. Product name. Total. Product price. 6

Export the XML File Structure Export the group structure from the Output Hierarchy tab as an XML schema. You can import the.xsd file when you create the Data Transformation project in the Data Transformation Studio. Click Export to XML Schema on the Output Hierarchy tab. The Designer creates the following.xsd file: <?xml version="1.0" encoding="utf-8" standalone="no"?> <!-- ===== AUTO-GENERATED FILE - DO NOT EDIT ===== --> <!-- ===== This file has been generated by Informatica PowerCenter ===== --> <xs:schema attributeformdefault="unqualified" elementformdefault="qualified" targetnamespace="www.informatica.com/cdet/xsd/mappingname_dt_pdf_source" xmlns="www.informatica.com/cdet/xsd/ mappingname_dt_pdf_source" xmlns:xs="http://www.w3.org/2001/xmlschema"> <xs:element name="pc_xsd_root" type="pc_xsd_roott"/> <xs:complextype name="pc_xsd_roott"> <xs:sequence> <xs:element maxoccurs="unbounded" minoccurs="0" ref="group1"/> </xs:sequence> </xs:complextype> <xs:element name="group1" type="group1t"/> <xs:complextype name="group1t"> <xs:sequence> <xs:element maxoccurs="unbounded" minoccurs="0" ref="group2"/> <xs:element minoccurs="0" name="account" type="xs:string"/> <xs:element minoccurs="0" name="period_ending" type="xs:string"/> <xs:element minoccurs="0" name="current_total" type="xs:decimal"/> <xs:element minoccurs="0" name="balance_due" type="xs:decimal"/> </xs:sequence> </xs:complextype> <xs:element name="group2" type="group2t"/> <xs:complextype name="group2t"> <xs:sequence> <xs:element maxoccurs="unbounded" minoccurs="0" ref="group3"/> <xs:element minoccurs="0" name="name" type="xs:string"/> <xs:element minoccurs="0" name="total" type="xs:decimal"/> </xs:sequence> </xs:complextype> <xs:element name="group3" type="group3t"/> <xs:complextype name="group3t"> <xs:sequence> <xs:element minoccurs="0" name="date" type="xs:string"/> <xs:element minoccurs="0" name="ref" type="xs:string"/> <xs:element minoccurs="0" name="product" type="xs:string"/> <xs:element minoccurs="0" name="total" type="xs:string"/> </xs:sequence> </xs:complextype> </xs:schema> Create the Target Definition The target is a billing database that stores the invoice information. The database has three tables that store invoice data. The Invoice_Header stores the invoice summary information. The Buyer_Total table stores the total sales by Buyer for each invoice number. The Transaction_Detail table stores transaction information, including the date, product, and price. The following figure shows the tables in the target definition: 7

Data Transformation Data Transformation is the application that transforms file formats such as Excel spreadsheets or PDF documents. Create Data Transformation projects in the Data Transformation Studio. Deploy the projects from the Data Transformation Studio to the Data Transformation repository. The Designer accesses the services in the Data Transformation repository when you create a Data Transformation source. The PowerCenter Integration Service accesses a Data Transformation service when it runs a workflow that has a Data Transformation source, target, or Unstructured Data transformation. Create the Data Transformation Project Create a parser project in the Data Transformation Studio. A Parser project extracts data and returns XML. For this example, you can import the parser project from the Data Transformation tutorial #3. The project files are in the following directory: <Data Transformation Installation>\tutorials\exercises\Files_for_Tutorial_3 You can import the parser to the Data Transformation Studio from the Results directory. The parser project is PDFInvoiceParser.cmw. The tutorial describes how to create the parser. To interface the project with PowerCenter, use the.xsd file that you exported from the Designer instead of the OrshavaInvoice.xsd file. 8

The parser runs a document processor to convert the data from a binary PDF format to text. The parser project uses positional formatting to determine the location of the data in the PDF. You configure the anchors that define the text location and the content. Define a repeating group for the buyer and a nested repeating group for each buyer transaction. Define a CalculateValue action to add product prices for each buyer and a total for the invoice. You can run the project in the Data Transformation Studio. View results from the sample data. When you call a Data Transformation service from PowerCenter, the Data Transformation Engine passes the XML back to the PowerCenter Integration Service. When you run the project, Data Transformation returns the following XML: <?xml version="1.0" encoding="windows-1252"?> - <Invoice account="12345"> <Period_Ending>April 30, 2003</Period_Ending> <Current_Total>351.04</Current_Total> <Balance_Due>475.07</Balance_Due> - <Buyer name="molly" total="217.65"> - <Transaction date="apr 02" ref="22498"> <Product>large eggs</product> <Total>29.07</Total> - <Transaction date="apr 08" ref="22536"> <Product>large eggs</product> <Total>58.14</Total> - <Transaction date="apr 08" ref="22536"> <Product>cheddar cheese</product> <Total>43.61</Total> - <Transaction date="apr 21" ref="22798"> <Product>cream cheese</product> <Total>26.98</Total> - <Transaction date="apr 29" ref="22903"> <Product>large eggs</product> <Total>59.85</Total> </Buyer> - <Buyer name="jack" total="133.39"> - <Transaction date="apr 12" ref="22570"> <Product>large eggs</product> <Total>29.93</Total> - <Transaction date="apr 18" ref="22734"> <Product>large eggs</product> <Total>59.85</Total> - <Transaction date="apr 25" ref="22841"> <Product>cheddar cheese</product> <Total>43.61</Total> </Buyer> </Invoice> Deploy the Data Transformation Project After you design and test the Data Transformation project, deploy the project as a service to a Data Transformation repository. Deploy the Data Transformation project to a Data Transformation repository that is on the same machine as the PowerCenter Client. The PowerCenter Client can access the repository to retrieve Data Transformation service names and port requirements. Deploy the Data Transformation project to the repository that is on the same machine as the Data Integration Service when you want to run the workflow. The PowerCenter Integration Service calls the Data Transformation service to transform the Data Transformation source. 9

Define the Service Name After you define the Data Transformation service, update the Data Transformation source with the service name. Add the Data Transformation service name in the Settings tab. The service name must be defined in the Data Transformation source or the mapping is invalid. The following figure shows where to enter the Data Transformation service name in the Data Transformation source: Configure the Workflow Before you can run the workflow, deploy the Data Transformation project to the Data Transformation repository that is on the same machine as the PowerCenter Integration Service. Configure the name of the source PDF in the session properties. If you want to process multiple PDF files, you can use a wildcard in the session properties. You can use the following wildcard characters in the session properties: * (asterisk) Match any combination of characters. For example, *.doc matches all files with the doc extension. Or, ab*.txt matches every file that begins with ab and has txt extension.? (question mark) Match one character. For example, ab?.txt matches any file with ab as the first two characters any character as the third character. The extension must be txt. The following figure shows how to configure the session to process multiple source PDF files: 10

The source file name is *Invoice*.pdf. The session is configured to use wildcards. Run the Workflow When you run the workflow, the PowerCenter Integration Service passes the source PDF to the Data Transformation Engine. The Data Transformation Engine parses the PDF and returns XML to the PowerCenter Integration Service. The PowerCenter Integration Service writes rows to the target database tables. The Integration Service writes the following row to the Invoice_Header table: XPK_Invoice Account Period_Ending Current_Total Balance_Due 1 12345 April 30, 2003 351.04 475.07 The Integration Service writes the following row to the Buyer table: XPK_Buyer FK_Invoice Buyer_Name Total 1 1 Molly 217.65 2 1 Jack 133.39 11

The Integration Service writes the following row to the Transaction_Detail table: XPK_Transaction FK_Buyer Date Ref Product Total 1 1 Apr 02 22498 large eggs 29.07 2 1 Apr 08 22536 large eggs 58.14 3 1 Apr 08 22536 cheddar cheese 43.61 4 1 Apr 21 22798 cream cheese 26.98 5 1 Apr 29 22903 large eggs 59.85 6 2 Apr 12 22570 large eggs 29.93 7 2 Apr 18 22734 large eggs 59.85 8 2 Apr 25 22841 cheddar cheese 43.61 Author Ellen Chandler Principal Technical Writer 12