Optimizing the Data Warehouse Infrastructure with Archiving



Similar documents
Charles Dickens A Tale of Two Cities A TALE OF TWO ARCHITECTURES. By W H Inmon. It was the best of times. It was the worst of times.

Deploying Network Load Balancing

Introduction to HBase Schema Design

Planning and Implementing An Optimized Private Cloud

Corporate performance: What do investors want to know? Innovate your way to clearer financial reporting

Designing and Deploying File Servers

Planning a Managed Environment

CRM Customer Relationship Management. Customer Relationship Management

Planning an Active Directory Deployment Project

7 Help Desk Tools. Key Findings. The Automated Help Desk

High Availability for Microsoft SQL Server Using Double-Take 4.x

Introducing Revenue Cycle Optimization! STI Provides More Options Than Any Other Software Vendor. ChartMaker Clinical 3.7

GUIDELINE. Guideline for the Selection of Engineering Services

High Availability for Internet Information Server Using Double-Take 4.x

Position paper smart city. economics. a multi-sided approach to financing the smart city. Your business technologists.

Enabling Advanced Windows Server 2003 Active Directory Features

Chapter 1. LAN Design

Technical Notes. PostgreSQL backups with NetWorker. Release number REV 01. June 30, u Audience u Requirements...

Motorola Reinvents its Supplier Negotiation Process Using Emptoris and Saves $600 Million. An Emptoris Case Study. Emptoris, Inc.

Designing an Authentication Strategy

EMC VNX Series. EMC Secure Remote Support for VNX. Version VNX1, VNX REV 03

Planning a Smart Card Deployment

Galvin s All Things Enterprise

aééäçóáåö=táåççïë= péêîéê=ommp=oéöáçå~ä= açã~áåë

Isilon OneFS. Version 7.1. Backup and recovery guide

9 Setting a Course: Goals for the Help Desk

Executive Coaching to Activate the Renegade Leader Within. Renegades Do What Others Won t To Get the Results that Others Don t

EMC ViPR. Concepts Guide. Version

Every manufacturer is confronted with the problem

The Boutique Premium. Do Boutique Investment Managers Create Value? AMG White Paper June

Effective governance to support medical revalidation

EMC VNX Series Setting Up a Unisphere Management Station

iet ITSM: Comprehensive Solution for Continual Service Improvement

Building Trust How Banks are Attracting and Retaining Business Clients With Institutional Money Fund Portals

Purposefully Engineered High-Performing Income Protection

Candidate: Kevin Taylor. Date: 04/02/2012

Modeling and Metadata Strategies for Next Generation Architectures

EMC Smarts SAM, IP, ESM, MPLS, VoIP, and NPM Managers

EMC PowerPath Virtual Appliance

Preparing your heavy vehicle for brake test

Accelerated Implementation Model

EMC Storage Analytics

Designing a TCP/IP Network

CRM Customer Relationship Management. Customer Relationship Management

MVM-BVRM Video Recording Manager v2.22

A guide to safety recalls in the used vehicle industry GUIDE

EMC ViPR Analytics Pack for VMware vcenter Operations Management Suite

Using GPU to Compute Options and Derivatives

NAPA TRAINING PROGRAMS FOR:

The Role of the Community Occupational Therapist

«Quality in Open Markets: How Larger Leads to Less»

The Good Governance Standard for Public Services

Closer Look at ACOs. Putting the Accountability in Accountable Care Organizations: Payment and Quality Measurements. Introduction

Closer Look at ACOs. Making the Most of Accountable Care Organizations (ACOs): What Advocates Need to Know

The Good Governance Standard for Public Services

Bosch Security Training Academy Training Course Catalogue uk.boschsecurity.com

Facilities. Car Parking and Permit Allocation Policy

Apache Hadoop. The Scalability Update. Source of Innovation

A taxonomy of knowledge management software tools: origins and applications

MSc and MA in Finance and Investment online Study an online MSc and MA in Finance and Investment awarded by UNINETTUNO and Geneva Business School

Our business is to help you take care of your business. Throgmorton Outsourcing Services. HR Services Payroll Immigration Health & Safety

Standard. 8029HEPTA DataCenter. Because every fraction of a second counts. network synchronization requiring minimum space. hopf Elektronik GmbH

ASAND: Asynchronous Slot Assignment and Neighbor Discovery Protocol for Wireless Networks

Formal modeling and analysis of XML firewall for service-oriented systems

Tax Considerations for Charitable Gifting

Candidate: Cassandra Emery. Date: 04/02/2012

AN OTT NETWORK FOR THE CONNECTED WORLD

Opening the Door to Your New Home

f.airnet DECT over IP System

Owning A business Step-By-Step Guide to Financial Success

VRM Video Recording Manager v3.0

Transcription:

WHITE PAPER Optimizing the Data Warehose Infrastrctre with Archiving By Bill Inmon

This docment contains Confidential, Proprietary and Trade ecret Information ( Confidential Information ) of Informatica Corporation and may not e copied, distrited, dplicated, or otherwise reprodced in any manner withot the prior written consent of Informatica. While every attempt has een made to ensre that the information in this docment is accrate and complete, some typographical errors or technical inaccracies may exist. Informatica does not accept responsiility for any kind of loss reslting from the se of information contained in this docment. The information contained in this docment is sject to change withot notice. The incorporation of the prodct attrites discssed in these materials into any release or pgrade of any Informatica software prodct as well as the timing of any sch release or pgrade is at the sole discretion of Informatica. Protected y one or more of the following U.. Patents: 6,032,158; 5,794,246; 6,014,670; 6,339,775; 6,044,374; 6,208,990; 6,208,990; 6,850,947; 6,895,471; or y the following pending U.. Patents: 09/644,280; 10/966,046; 10/727,700. This edition plished April 2010

White Paper Tale of Contents Exective mmary... 2 The Evoltion of the Data Warehose... 3 The Data Lifecycle within the Data Warehose... 3 Dormant Data in the Data Warehose... 4 Data Warehose 2.0.... 6 Partitioning Data in the Data Warehose... 6 Using torage Tiers to Manage Warehose Data... 6 Data Archiving to Optimize torage Tiers... 8 Indexing Archival Data.... 8 The Changing trctre of Data over Time... 9 Informatica Data Archive : The Complete Data Warehose Archiving oltion.... 11 Rost Archiving Techniqes Enale Optimal torage Tiers.... 11 Mltiple, Easy Access Methods to Archived Data... 13 Atomatic Indexing of Archival Data.... 13 Atomatic Management of Changing Data trctres... 13 Universal Connectivity... 13 Integration with Other Archiving Platform, ECM, and torage oltions.... 13 Conclsion.... 14 Aot Bill Inmon... 15 Optimizing the Data Warehose Infrastrctre with Archiving 1

Exective mmary The world of data and information has een in a constant state of evoltion since the first sage of compters in the late 1950 s. Over time, it ecame apparent that data, like so many entities, has a lifecycle and niqe to each point in the lifecycle, a different set of characteristics, storage, and access reqirements. The concept of a data warehose evolved from the siness need for reliale, consolidated and integrated data reporting and analysis from varying points in its lifecycle, across disparate data sorces. While in a gross sense, a data warehose is simply a repository of an organization s electronically stored data, it is important to recognize that any warehose is only as good as the processes to find, access and move items into and ot of the warehose. For data, the essential components of a data warehosing system incldes the aility to selectively store data, to retrieve and analyze data no matter where it s located, and to manage the data dictionary. Operating an efficient data warehose reqires the organization to nderstand the differences inherent in the information stored in the data warehose according to its point within the data lifecycle. As data ages: The proaility of that data eing accessed drops. imply pt, the older data ecomes, the less freqently it is sed. The strctre of data changes. As software grows increasingly complex to process and handle more data with greater efficiency, y necessity dataase architectres change. This is often seen in a steady stream of software releases taking advantage of increasingly more powerfl hardware and software technologies. The amont of data eing stored grows exponentially. Governed y oth indstry and government reglations, data mst e stored and kept accessile for years. While only the first year s worth of data is actively sed, maintaining historical data can easily alloon data storage to as mch as 20 times larger than the crrent prodction dataase. This white paper will address the isses created y a complex data lifecycle within the data warehose and how data archiving can etter manage growing data volmes. By nderstanding the dynamic forces at work governing the explosion of data volmes in the data warehose, and the technologies availale today to effectively archive and retrieve data ased on its point in the lifecycle, the operation and cost of the data warehose can e made more manageale, prodctive and efficient. Implementing rost archiving techniqes will provide an optimal and cost-effective archiving infrastrctre for the data warehose that: Maintains data integrity across mltiple formats Enales easy, on-demand access to archival data Provides niversal connectivity and integrates with mltiple archiving platforms to ensre sperior and cost-effective scalaility and performance. Efficiently stores archived data to save storage capacity, while facilitating fast data retrieval 2

White Paper The Evoltion of the Data Warehose The most important achievement of the data warehose was the aility to create a platform for integrating corporate data from mltiple enterprise applications to facilitate analysis and reporting. This profond transformation allowed the organization to have for the first time a single, integrated corporate dataase. It is this complete set of integrated data that allows an organization to view information enterprise-wide, from a tre organizational perspective. Data warehose Corporate Data Legacy/operational applications ETL historical integrated detailed non volatile Figre 1. The classic data warehose is ilt y passing legacy and operational application data throgh ETL By integrating more and more data from a growing variety of data sorces, organizations grew more sophisticated in handling their data, exposing the need for an expanded set of information processing capailities. From the asic data warehose, simply a collection of aggregated historical data, the need for a second generation data warehose architectre and design egan to evolve. The Data Lifecycle within the Data Warehose As organizations ecame experienced with a first generation data warehose, dataase administrators noticed that most of the qeries were going against the most recent six months worth of data. This first manifestation of a data lifecycle within the data warehose came with a growing awareness that as data aged, the proaility that sch data wold e accessed dropped. The older data ecame, the less freqently the data was accessed. Even more importantly came the awareness that as the data warehose aged, the volme of data increased. Data in a data warehose grows at an explosive rate. In the first one or two years of a data warehose the volme of data often grows at a 200% to 500% rate per year. This rate contines to accelerate ntil the forth or fifth year when the data warehose rate of growth drops to aot 100% to 200% per year. Bt y that time there is already a significant amont of data that has een collected in the data warehose. For a variety of reasons, the data warehose cased an explosion of data that was to e managed y the corporation. Optimizing the Data Warehose Infrastrctre with Archiving 3

This explosion of data in the warehose has two major impacts: The impact on performance as data grows cases degradation across the enterprise, creating ottlenecks that negatively impact each ser s aility to access data in a timely manner The growing cost of adding disk storage and increased cost to maintain an IT infrastrctre to spport it As long as IT organizations can maintain systems with jst the relevant amont of crrent data that is reglarly accessed for daily operations, performance is optimal. Bt as the system accmlates a large amont of historical data with only a small portion of that data eing sed, performance worsens. Performance degrades ecase the system mst process and handle large amont of data that is not sed. An analogy wold e cholesterol in the ody. In the circlatory system of the yong marathon rnner there is very little cholesterol and the yong athlete has a very efficient heart. Bt in a 65 year old coch potato, there is an accmlation of cholesterol casing stress to the heart which has to expend more effort to maintain proper circlation. The same is tre of a large data warehose where the system contains a large volme of nsed data. The system has to manage hge amonts of nnecessary data, and in doing so ses machine cycles that wold otherwise not e necessary. By maintaining this explosion of data in a data warehose, IT infrastrctre and maintenance costs grow exponentially even thogh the percentage of the data that is actally sed decreases. What complicates matters is that after a certain volme of data, costs rise dramatically as spporting this data egins to reqire more than jst physical disk. The infrastrctre egins to reqire additional processors, complex disk arrays, additional software, and of corse, staff time to operate and maintain the growing systems, casing the associated IT cost to increase exponentially. Dormant Data in the Data Warehose An analysis of sage patterns shows that most qeries se only the most crrent data, with a larger and larger portion of the data warehose not eing sed. Within jst two years of collecting data, most organizations find that only the first six months is eing analyzed, leaving approximately 18 months of data ntoched a trend that contines naated as data is collected over longer periods. The reslt is that the vast majority of the data in the data warehose is simply never toched y anyone. 50 g 49 g sed 500 g 400 g sed 2 t 600 g sed 10 t 700 g sed Figre 2. As the volme of data grows in the data warehose, the amont and percentage of data that is actally sed drops. 4

White Paper The organization has jst discovered what is termed dormant data. Dormant data in a data warehose is like a 2000 pond anchor on a two man rowoat. It simply cases prolems far more disproportionately than one wold ever imagine. One way to nderstand the impact of dormant data in a data warehose is ased on the proaility of access. In a first generation, matre data warehose, there is typically some crrent data that is sed very freqently and a lot of data that is rarely or never sed at all. High proaility of access Low proaility of access Figre 3. Data can e groped into different proaility of access The next stage in the evoltion of data warehose architectre now ecomes apparent it makes oth economic and technological sense to move dormant data ot of the prodction system to some other storage media, in a different data tier. There are three main reasons for moving dormant data ot of the first generation data warehose environment: The cost of the data warehose infrastrctre is greatly redced y the movement of data from the first generation data warehose into another less expensive storage media. By moving dormant data ot of the first generation data warehose into the different storage tiers availale in the next generation data warehose, the organization can now handle mch larger data volmes than cold ever e handled y a first generation data warehose. Performance improves y alleviating the stress created y maintaining a hge dataase infrastrctre Optimizing the Data Warehose Infrastrctre with Archiving 5

Data Warehose 2.0 Based on limitations in the first generation data warehose, a second generation, DW 2.0, evolved to recognize and spport the lifecycle within the data warehose. There are several sstantial differences etween the first generation data warehoses and DW 2.0, most notaly the recognition that as data ages, its characteristics and access reqirements change. As a conseqence, the infrastrctre in DW 2.0 is divided into different storage types ased on the age of the data. Data is first placed on a high performance storage type and is moved over time from this high end storage type to the next lower cost and lower performance storage type ased on the proaility of data access. This second generation data warehose recognizes the need for dataase partitions, indexes and storage tiers. Partitioning Data in the Data Warehose One standard practice for managing the data warehose environment is the aility to reak the archival data p into partitions. While there are many ways to partition warehose data, the most common is to divide the data y date. One partition contains the data from 2003, the next partition contains data from 2004, the next partition contains data from 2005 and so forth. This mode of partitioning is natral ecase the data arrives y date. Other strategies can also e employed, sch as partitioning y organizational nit, y geography, and so forth. And data can e partitioned y more than one set of parameters. For example data can e partitioned y date and geography, or y date and organizational fnction, and so forth. By dividing data into partitions, data can e strctred according to ser access patterns. earches that can eliminate data in mltiple partitions at once can e condcted more qickly and efficiently, lowering the cost of accessing data and redcing processing demands. 2003 2004 2005 Conslting Leasing Hardware Figre 4. Different ways of partitioning data Using torage Tiers to Manage Warehose Data Erope Asia Africa To frther optimize the data warehose infrastrctre, data partitions can e located on different storage tiers, having different performance, access, availaility, and cost characteristics, ased on the access reqirements of the data. There are many reasons for separating a first generation data warehose into physically separate s divisions that resides on and is managed on different storage tiers. Figre 5. Data with low proaility of access can e moved to alternate storage 6

White Paper The most ovios and compelling of those reasons is economics. By separating ot the first generation data warehose into separate physical storage tiers, the small amont of data that is sed freqently resides on expensive high performance disk storage, and the lk of the data that is not eing sed resides on less expensive storage media. Different storage tiering strategies can e employed. One possile strategy is to define storage tiers ased on the performance reqirements arond data access and data pdates: The interactive tier is where transaction processing takes place. The proaility of access of data in the interactive tier is high. The integrated tier is the place where corporate data is created. In the integrated tier is fond the classical first generation data warehose. There is a reasonaly high proaility of access for data fond in the integrated tier. The near line storage tier is optional. ome organizations need near line storage and some organizations do not. Typically, near line storage is a cache for the integrated tier. Data fond in the near line storage has a low proaility of access. The archival tier contains data with the lowest proaility of access However it is done, data needs to e physically removed from the core prodction data warehose, where in the context of the aove storage tiering strategy, the integrated storage tier wold likely reside. This means that the data can e relocated to other storage types sch as less expensive disk or file-ased storage. However it is done, dormant data needs to e placed on a separate storage medim than that of the core prodction data warehose. This leads to the evoltion of data warehose archiving. Expensive Inexpensive Figre 6. One of the enefits of moving data to alternate form of storage is to redce cost Optimizing the Data Warehose Infrastrctre with Archiving 7

Data Archiving to Optimize torage Tiers Data archiving can e employed to atomatically and physically relocate data with lower siness vale in data warehoses to more appropriate and cost-effective storage tiers. Data can have lower siness vale ased on a nmer of criteria, sch as data access and performance reqirements, the age of the data, which region or department the data pertains to, or partition sage. As low access data grows to consme the lion s share of the data warehose, the most logical progression is for this data to e physically and logically separated from the core prodction data warehose. Once the organization nderstands the isses of data management, the related economics, the isses of dormant data, and the evoltionary pressres created y data growth, the conclsion is inevitale that first generation data warehoses evolve to DW 2.0, and in doing so, the archival data storage tier is created. The archival storage tier in the DW 2.0 data warehose environment has many different characteristics that set it apart from the other parts of the data warehose. The proaility of access of data in the archival tier is low. Data is normally not pdated in the archival environment. Dataase design may or may not e the same etween the two environments. The major drivers for data warehose archiving are sally to redce infrastrctre cost y storage tiering, redce maintenance cost, and maintain peak data warehose performance. imply relocating inactive data from the prodction data warehoses to lower-cost servers and storage achieves many of these goals, t yor siness reqirements are likely to e more complex, sch as how yo access and retrieve archived data. Yo need to consider yor organization s dget constraints and performance and access reqirements when selecting a data warehose archiving soltion. Yor IT organization will proaly access archived data less freqently than active data. Bt yo may still have to periodically retrieve the comined archived and crrent data directly from the original application interface. In this case, the data shold e archived to a format that facilitates relatively high qery performance sch as another data warehose instance, located on a lowercost infrastrctre. On the other hand, if inactive data is qite old and is ready to e retired, yo may have to access it only rarely. In this case, access from a reporting or e-discovery tool, rather than from an application interface, may e adeqate. lower qery performance can e tolerated, and the data may e archived to a more optimal, compressed format, sch as a compressed file. Indexing Archival Data Another significant component of ilding a data archiving environment is the practice of creating passive indexes. In the active parts of the data warehose, the practice of creating indexes to enhance performance is very common. In the archival environment, however, projecting siness reqirements for ftre data access can e difficlt. Generally, the archival environment is examined whenever a siness need arises. Bt the siness need may not e recognized for 20 years after the archival data is stored. Therefore, the processes sed to ild indexes in the data warehose do not apply to the archival environment. To that end, there is the design practice of creating what are called passive indexes. 8

White Paper Typically passive indexes are ilt sing the likely or possile criteria for fast ftre retrieval of archival data. Part nmers, cstomer names, order nmers, phone calls made, and an episode of care all are likely pieces of data that cold e indexed. An analysis of common sage patterns can help determine what data is likely to e referenced in the ftre. Archiving software shold e ale to analyze the data and atomatically create indexes dring the archival process, optimizing it for ftre access. Figre 7. The more archival data can e indexed, the faster sseqent searches will e The Changing trctre of Data over Time It is ecase every organization ndergoes change, and every change is ltimately reflected in the data strctre, that the dataase designer expects the data strctre fond in the archival environment to not remain constant. As data is added to the archival environment, the isse of managing data stored in different releases of software technology over a long period of time arises. ppose that an organization starts to store data in the archival environment in 1990 nder release 2.0 of a prodct. By 1996 the data is stored nder release 3.1 of the prodct. More time passes and y 2005 data is stored nder release 8.i. In 2010 data is stored nder release 11.4. Release 3.1 Release 8.i Release 2.0 Release 11.4 Figre 8. Over time, newer releases of the software will change the data strctre ch a progression is asoltely normal. The qestion now ecomes can the crrent software release read and recognize data that was stored nder an earlier release? Usally software vendors can handle the previos release of the software. Bt when it comes to going ack to a software release that is a decade old (or even two decades old), a time comes when a vendor can no longer spport a past data architectre while providing the new fnctionality that has ecome essential. Optimizing the Data Warehose Infrastrctre with Archiving 9

j j j j j j j j j j j j l l l j j j There are many approaches to the handling changing data strctres in the archival environment. One essential element of the archival environment is that of metadata the descriptive information that defines the context and strctre of the archival data. Maintaining the right metadata is essential to handling strctral changes to archival data. One soltion to managing strctral changes is to maintain mltiple metadata versions corresponding to the strctral changes over time. Another soltion is to pdate the metadata periodically to synchronize the archival metadata with the core prodction data warehose strctre. Regardless of the approach, a data archive soltion needs to handle strctral changes to archival data ased on the evoltion of the prodction data warehose over time and shield the ser from the maintenance nightmare. 2004 2005 2006 Figre 9. Over time, the asic strctre of data changes By evolving the data warehose infrastrctre to the DW 2.0 architectre, organizations ecome etter ale to alance data to meet access and system performance reqirements. In doing so, the cost of the data warehose is mitigated, enaling the data warehose to more efficiently accommodate hge amonts of data. In addition the data warehose can store and manage data over a wide range of time. DW 2.0 manages data that is two seconds old and data that is 20 years old. Data Warehose DW 2.0 DW 2.0 Interactive Very crrent Architectre for the next generation of data warehosing ETL, data qality Transaction data A p p A p p A p p LOCAL METADATA Integrated Crrent++ Near line Less than crrent Unstrctred ETL Textal sjects Internal, external imple pointer Captred text Text id... Linkage Text to sj Textal sjects Internal, external imple pointer Captred text Text id... Linkage Text to sj Detailed mmary Detailed ETL, data qality mmary Continos snapshot data Profile data Continos snapshot data Profile data j j j j j j ENTERPRIE METADATA REPOITORY MATER DATA Archival Older Textal sjects Internal, external Captred text Text id... imple pointer Detailed Continos snapshot data Profile data Linkage Text to sj mmary Unstrctred trctred Figre 10. There is a natral evoltion of data warehoses from the classic first generation to DW 2.0 10

White Paper Informatica Data Archive : The Complete Data Warehose Archiving oltion Informatica Data Archive helps yor IT organization to cost-effectively manage the explosion of data volmes in data warehoses. It allows yo to easily and safely archive inactive data, and then readily access it when needed. Informatica Data Archive delivers the fll range of capailities that yor IT organization needs to effectively manage data growth in data warehoses, inclding: Rost archiving techniqes that ensre data integrity after archiving and spporting mltiple archive formats to enale optimal storage tiers Mltiple, easy access methods to archived data Atomatic indexing of archived data Atomatic management of changing data strctres Universal connectivity Integration with other archiving platforms, ECM, and storage soltions, sch as ymantec, Commvalt, and EMC By leveraging the power of the Informatica Platform, the indstry s leading data integration platform, Informatica Data Archive enales organizations to handle the hge data volmes typical of very large gloal enterprises. The software provides sperior scalaility and performance, delivering data to the most cost-effective storage option ased on their vale. It also offers nparalleled interoperaility. The software is ased on an open, easily extensile architectre, enaling simple integration with third-party soltions. Rost Archiving Techniqes Enale Optimal torage Tiers With Informatica Data Archive, yo can archive to another data warehose instance or to a highly compressed file format that can reslt in dramatic storage capacity saving. The compression ratio that can e achieved is ased on the size of the data the larger the data size, and the more redndancy in data vales, yo may e ale to achieve a compression ratio of 20:1 to 60:1 compared to the original data size. The choice of archiving to another data warehose or a compressed file archive shold e ased on the age of the data and response time as well as freqency of access. If yo still need to access the data with relatively high freqency and with high performance, then archiving to another data warehose instance is more appropriate. However, if data will e rarely accessed, for infreqent reporting or adit reqirements, then archiving to a highly compressed file is the more optimal soltion. Archived data can e stored on a file system located in lower cost storage or even storage in the clod, for economies of scale. As data ages and access reqirements change over time, Informatica Data Archive atomatically converts and relocates the data from one archiving format and location to another, enaling mltiple cost-effective storage tiers. Informatica Data Archive enales yo to archive transactional and detailed data only, which are the fastest growing. This is done while maintaining data integrity and links to dimensional and aggregate tales that may still e stored in the prodction system. Eventally, some older dimension records may also e archived as well. Informatica Data Archive has deep knowledge aot what types of tales shold e archived to spport an optimal archiving strategy. Informatica Data Archive can also handle partitions that were created in the prodction data Optimizing the Data Warehose Infrastrctre with Archiving 11

warehose and maintain those data partitions in the data archive, to maintain scalaility and performance. Figre 11 illstrates a data warehose archiving strategy where detailed data are slowly relocated to another dataase and sseqently to a more optimal compressed file format, which reslts in extreme redction in storage capacity. Informatica Data Archive provides an easy to se graphical interface to define archiving jos easily withot extensive configration, scripting, or programming. Figre 11 shows the Informatica Data Archive wizard-ased interface to allow sers to easily define and monitor archiving jos. Prodction Data Warehose (less than 2 years old) Archive Data Warehose (2-7 years old) Optimized File Archive (40:1 compression) (over 7 years old) DIM1 DIM2 DIM3 AGG1 AGG2 AGG2 DETAIL 1 DETAIL 2 OLD_DIM3 DETAIL 3 DETAIL 4 OLD_DIM2 DETAIL 5 DETAIL 6 DETAIL 7 Figre 11. Informatica Data Archive offers mltiple archiving formats (dataase or compressed file) that enale optimal storage tiering and the flexiility to archive different types of records while maintaining data integrity A data warehose archiving soltion that offers mltiple archiving formats and accessiility options allows IT organizations to determine the appropriate tradeoffs among archive size, performance, application accessiility, and cost. Figre 12. Archive complete siness entities sing Informatica Data Archive. Yor IT organization mst also e ale to restore archive data to its original location. Otherwise, there is no way to correct mistakes dring archiving or to accommodate changes to access reqirements. If archived data later needs to ecome active again and for some reason modified and annotated, then it also needs to e restored. For example, a cstomer order that is closed and reopened may need to e restored ecase it has ecome active again. Informatica Data Archive allows yo to restore archived data at different levels of granlarity, sch as selected detail records, siness entities, or an entire archive. 12

White Paper Mltiple, Easy Access Methods to Archived Data Regardless of the archive format, archived data needs to e easily accessile either from the original application interface or throgh standard interfaces for reporting or compliance adits. Informatica Data Archive spports standard QL/ODBC/JDBC interfaces for reporting sing any reporting or siness intelligence tool. The soltion also offers the option to access the data from an application-aware data discovery portal to easily search, rowse, and view archived or retired data ased on siness entities and with a similar look-and-feel as the original application interface. Atomatic Indexing of Archival Data When archiving data to another data warehose instance Informatica Data Archive atomatically ilds and maintains indexes that exist in the prodction data warehose instance. When archiving to a highly compressed file archive, data is atomatically indexed and stored in an optimal format to facilitate efficient storage and scalale retrieval. No performance tning and maintenance is reqired on the archival data, redcing IT staff time. Atomatic Management of Changing Data trctres As the prodction data warehose strctre contines to evolve, Informatica Data Archive atomatically pdates the metadata and strctre of the archival data warehose. When archiving to a highly compressed file format, Informatica Data Archive maintains mltiple versions of the metadata, corresponding to periodic snapshots of the prodction data warehose strctre. This enales point-in-time qerying of the archival data ased on the strctre of the data warehose at that point in time. By atomatically managing the metadata and strctre of archive data, ased on the changing strctre of the prodction data warehose, Informatica Data Archive redces the maintenance effort reqired on the archival infrastrctre. Universal Connectivity If yor organization is like many other enterprises, yo have data warehoses and applications on mltiple dataase systems on varying operating systems. To spport yor enterprise needs, Informatica Data Archive enales yo to manage archive processes across data warehoses and applications on diverse dataases, inclding relational (e.g. Oracle, DB2, yase, QL erver, Teradata, Informix), mainframe (e.g. IDM, VAM, IM), files, and packaged CRM and ERP applications on open systems (e.g. Windows, Linx, UNIX) or mainframes (e.g. z/o, A/400). Integration with Other Archiving Platform, ECM, and torage oltions Yor company may already have an archiving soltion for e-mails and files. Yor IT organization may also have standardized on an Enterprise Content Management (ECM) soltion to manage yor nstrctred data. To spport compliance to reglatory reqirements and ensre immtaility and single instance storage of retained data, yo may e sing archiving platforms, sch as Content Addressale torage (CA), which reqires proprietary connectivity. To enale yor organization to respond qickly and accrately to adit reqests as well as to cost-effectively retain data for longer periods, Informatica Data Archive allows yo to manage and discover archived data of all types, oth strctred and nstrctred, centrally. This is achieved throgh integration with existing archiving, content management, and storage soltions, inclding EMC Docmentm, ymantec Enterprise- Valt and Discovery Accelerator, and CommValt impana and ediscovery, to facilitate centralized management and e-discovery of all types of archived data. Optimizing the Data Warehose Infrastrctre with Archiving 13

Conclsion Based on an explosion of data in corporate environments, the data warehose has evolved from a simple platform for reliale, consolidated and integrated data reporting to a sophisticated data infrastrctre that recognizes a complex data lifecycle. By nderstanding the dynamic forces at work in data growth, storage and accessiility, and the technologies availale today to effectively manage the operation and cost of the data warehose, IT organizations shold now e etter positioned to implement soltions that make their data warehose environments more manageale, prodctive and efficient. DW 2.0, the second generation data warehose, recognizes that as data ages, its characteristics and access reqirements change. By dividing data into different storage tiers ased on the age and freqency of access, from high performance storage for interactive data to lower cost, lower performance storage for low access or inactive data access, DW 2.0 provides a platform for managing warehose data more effectively. The key to capping yor IT organization s data warehose management costs and risks is to relocate dormant data to a lower-cost infrastrctre that the storage tiering in DW 2.0 architectre makes availale. This is what data warehose archiving soltions can do for yo archive data ased on its point in the lifecycle, while maintaining data integrity and easy access to the data. Informatica Data Archive enales organizations to handle the hge data volmes typical of very large gloal enterprises. By providing comprehensive and rost techniqes to easily and safely archive inactive data, and then readily access it when needed, Informatica Data Archive delivers the complete archiving soltion necessary to provide an optimal and cost-effective data warehose infrastrctre. When yor IT organization implements a complete, scalale, and flexile archiving soltion, yo ll redce the total cost of ownership of yor data warehoses and other applications y: Redcing storage, server, software, and maintenance costs Improving data warehose performance Increasing data warehose availaility pporting compliance with internal, indstry, and governmental mandates and reglations Together, Informatica and yor IT organization can align the siness vale of data with the most appropriate and cost-effective IT infrastrctre to manage it. 14

White Paper Learn More Learn more aot the Informatica Platform. Visit s at www.informatica.com or call +1 650-385-5000 (1-800-653-3871 in the U..). Aot Informatica Informatica Corporation (NADAQ: INFA) is the world s nmer one independent provider of data integration software. Organizations arond the world gain a competitive advantage in today s gloal information economy with timely, relevant and trstworthy data for their top siness imperatives. More than 3,900 enterprises worldwide rely on Informatica to access, integrate and trst their information assets held in the traditional enterprise, off premise and in the Clod. Aot Bill Inmon Bill Inmon, the father of data warehosing, has written 52 ooks translated into 9 langages. Bill fonded and took plic the world s first ETL software company. Bill has written over 1000 articles and plished in most major trade jornals. Bill has condcted seminars on every continent except Antarctica. References W H Inmon, DW 2.0 ARCHITECTURE FOR THE NEXT GENERATION OF DATA WAREHOUING, 2008, Morgan Kafman, Boston Mass Inmoncif.com a wesite with many white papers and other information aot data warehoses and DW 2.0. Optimizing the Data Warehose Infrastrctre with Archiving 15

Worldwide Headqarters, 100 Cardinal Way, Redwood City, CA 94063, UA phone: 650.385.5000 fax: 650.385.5500 toll-free in the U: 1.800.653.3871 www.informatica.com 2010 Informatica Corporation. All rights reserved. Printed in the U..A. Informatica, the Informatica logo, and The Data Integration Company are trademarks or registered trademarks of Informatica Corporation in the United tates and in jrisdictions throghot the world. All other company and prodct names may e trade names or trademarks of their respective owners. First Plished: April 2010 7117 (04/01/2010)