#MMTM15 #INFOARCHIVE #EMCWORLD 1 1
INFOARCHIVE A TECHNICAL OVERVIEW DAVID HUMBY SOFTWARE ARCHITECT #MMTM15 2
TWEET LIVE DURING THE SESSION! Connect with us: Sign up for a Hands On Lab 6 th May, 1.30 PM, Galileo 906 Attend Big Data Velocity & Information Governance 6 th May, 3 PM Session Share your thoughts Take the survey In the App or via email Set Your Data Free 6 th May, 3 PM, Galileo 1004 #MMTM15 #INFOARCHIVE #EMCWORLD 3
4
AGENDA InfoArchive High Level Goals Architecture Core Principles Scalability Retention Policy Services Integration #MMTM15 #INFOARCHIVE #EMCWORLD 5
LEGACY APPLICATIONS READ ONLY ACTIVE APPLICATIONS READ & WRITE INFRASTRUCTURE & STORAGE SHUT DOWN APPLICATON STORAGE COSTS BACKUP COSTS PERFORMANCE DATA GROWTH IT COST EXTENDED AND CONTROLLED DATA ACCESS DATA USERS EMC INFOARCHIVE COMPLIANCE MANAGEMENT LINE OF BUSINESS, CUSTOMERS, COMPLIANCE, AUDITOR
HIGH LEVEL GOALS Number of sources / data types Data volumes Number of requests (search, retrieval) Scalability TCO Flexibility Data types SLA's (ingestion, search, retrieval) Compliance requirements Infrastructure costs Operating costs Costs to add a new data type, source or client application
HIGH LEVEL GOALS Flexibility : Data types Structured data Flat records Complex records Documents Property Name Title Keywords Customer
HIGH LVEL GOALS Flexibility : Data types Complex structured records which can have a varying number of associated unstructured contents
Flexibility : Data types POTENTIAL REFERENCE TO AN EXTERNAL ASSOCIATED CONTENT COMPLEX DATA MODEL OF THE SOURCE APPLICATION RDBMS On direct link between source model & archive model SIMPLE SELF DESCRIBING DATA MODEL Information independently comprehensible Purchase Order PO Number PO Date Customer Ref Business Object Product Description Simplifies adapting to source data model changes e.g. application upgrades
EXAMPLE TABLE DATA: EMPLOYEES 5 tables: employees.xml dep_emp.xml <employees> <employee> departments.xml salaries.xml titles.xml <emp_no>10001</emp_no> <birth_date>1953-09-02</birth_date> <first_name>georgi</first_name> <last_name>facello</last_name> <gender>m</gender> <hire_date>1986-06-26</hire_date> </employee> <employee> <emp_no>10002</emp_no> <birth_date>1964-06-02</birth_date> <first_name>bezalel</first_name> <last_name>simmel</last_name> <gender>f</gender> <hire_date>1985-11-21</hire_date> </employee>. </employees> <dep_emps> <dep_emp> <emp_no>10001</emp_no> <dept_no>d005</dept_no> <from_date>1986-06-26</from_date> <to_date>9999-01-01</to_date> </dep_emp> <dep_emp> <emp_no>10002</emp_no> <dept_no>d007</dept_no> <from_date>1996-08-03</from_date> <to_date>9999-01-0</to_date> </dep_emp>. </dep_emps>
EXAMPLE DATA: EMPLOYEE BUSINESS OBJECT <employee> <emp_no>10015</emp_no> <birth_date>1959-08-19</birth_date> <first_name>guoxiang</first_name> <last_name>nooteboom</last_name> <gender>m</gender> <hire_date>1987-07-02</hire_date> <departments> <dept_no>d008</dept_no> <dept_name>research</dept_name> <from_date>1992-09-19</from_date> <to_date>1993-08-22</to_date> </departments> <titles> <title> <title>senior Staff</title> <from_date>1992-09-19</from_date> <to_date>1993-08-22</to_date> </title> </titles> <salaries> <salary> <amount>40000</amount> <from_date>1992-09-19</from_date> <to_date>1993-08-22</to_date> </salary> </salaries> </employee>
HIGH LEVEL GOALS Scalability & TCO The performance and the TCO must not significantly change with the number/volume of archived items Number/volume of archived items Millions hundreds Billions TB Hundreds TB Time Operating costs Performances
ARCHITECTURE Source & client applications Connectors/Extractors Data Confirmations Batch & transactional ingestion Data Access GUI Data Services (xdb) Archive Services Content Services (Content Server) Storage Platform EMC (Atmos, Centera, Isilon, VNX) + Others
ARCHITECTURE Data Batch Ingestion Module Reception Scheduling Ingestion Commit Confirmations Archive Services Module Access Control Classification Confirmation Generation Auditing Retention Mngt Archive Storage Archived Data Source Applications Transactional Ingestion Module Transactional ingestion Rejection/invalidation Systems Administration Reporting Portal Client applications Data Access Module Synchronous search Asynchronous search Content retrieval Partition Mngt Package Aggregation Partition caching Async. search. exec. Audits, logs
ARCHITECTURE Source Application File Transfer Receiver Reception area Legend EMC product InfoArchive component Third party product Storage area Job Scheduler Enumerator Ingestor Application Server Documentum Administrator DA (DA) extension s GUI Working area Ingestion working area Content Server Staging area RDBMS datafiles RDBMS database Archiving storage (e.g. Atmos, Centera, Isilon, NAS, SAN ) xdb cache Client Application Web Services Order processor xdb database xdb datafiles
CORE PRINCIPLES Open Archival Information System (OAIS ISO14721) Alignment with the OAIS Framework defining the Core features and processes of a digital archive
Ingestion CORE PRINCIPLES Base OAIS terminology Archive Holdings for information classification Data Archive Services Archive Storage Data Access EMC InfoArchive Submission Information Packages SIP s Archive Information Packages AIP s Dissemination Information Packages DIP s An AIP contains Archived Information Units (AIU's)
CORE PRINCIPLES What do we archive? SIP Submission Information Package SIP
CORE PRINCIPLES What do we archive? SIP Descriptor : Describes the data in the SIP, it is a small XML file which must conform to a simple schema imposed by InfoArchive SIP Data : The data to be archived SIP SIP Descriptor SIP Data <XML>
CORE PRINCIPLES What do we archive? The payload consists of a number of business objects called AIU (Archive Information Units) Documents, transactions, customer case,... SIP SIP Data SIP Descriptor <XML> AIU AIU AIU AIU AIU AIU AIU AIU AIU AIU AIU AIU AIU AIU AIU AIU AIU AIU AIU AIU AIU AIU AIU AIU
CORE PRINCIPLES What do we archive? Each AIU always has: Meta Data (structured data) SIP SIP Data SIP Descriptor <XML> Meta Data (structured data)
CORE PRINCIPLES What do we archive? Each AIU consists of: Meta data (structured data) And optionally content information (unstructured data) SIP SIP Data SIP Descriptor <XML> Content Information (unstructured data) Meta Data (structured data)
CORE PRINCIPLES What do we archive? Meta data for all AIUs are contained in a single XML file Content information is a collection of files in any format SIP SIP Data SIP Descriptor <XML> Content Information Meta Data <XML>
CORE PRINCIPLES What do we archive? InfoArchive does not impose The schema of the structured data, it can be as complex as needed How content files are referenced in structured data A given content file can be referenced by several AIUs AIUs can reference a varying number of content files SIP eas_sip.xml eas_pdi.xml recording1.mp3 recording2.mp3 recording3.mp4 SIP descriptor Structured data Content Information (if any)
CORE PRINCIPLES How is information processed & stored? An AIP lightweight (LWSO) repository object is created Information read in the descriptor are assigned as properties of the object Content information is imported as content of the object Meta data is imported in an xdb partition (aka detachable library) SIP SIP Descriptor SIP Data Content Information AIP object Repository Meta Data xdb
CORE PRINCIPLES Data Partitions A compressed copy of the xdb partition data file can be imported as rendition of the AIP object Allows to skip this partition during xdb backup The caching service can cache out/in an xdb partition AIP object <XML> xdb Caching Service DATA Archive Storage (e.g. CAS) DATA Detached xdb Storage (FS)
CORE PRINCIPLES Query Processing The caching service can be configured for each data type according to its associated search SLA All xdb partitions can be cached in for providing synchronous searching on all data xdb partitions of a predefined recent period can be kept in the cache for providing synchronous searching on this period It is possible to post an asynchronous search even if it embraces a cached out partition xdb Caching Service DATA Archive Storage (e.g. CAS) DATA Detached xdb Storage (FS)
CORE PRINCIPLES How do we search information? InfoArchive uses a two-phase search Find packages (AIPs) that might contain records (AIU s) that match the search criteria. Search within these AIP s for individual AIU s matching the search critera. AIU AIU AIU AIP AIU AIU AIU AIP AIU AIU AIU AIU AIU AIU
CORE PRINCIPLES Search Example A company records phone calls and archives the recordings SIP SIP Data SIP Descriptor Phonecall Phonecall Phonecall Phonecall Phonecall Phonecall Phonecall Phonecall Phonecall Phonecall Phonecall Phonecall
CORE PRINCIPLES Search Example A company records phone calls and archives the recordings SIP SIP Data SIP Descriptor Content Information Meta Data <Calls> <Call>...</Call> <Call>...</Call> <Call>...</Call> <Call>...</Call> </Calls>
CORE PRINCIPLES Search Example The min./max. values contained in some elements of the structured data of the SIP can be extracted during the ingestion and be assigned as attributes of the AIP repository object The first phase of the search consists in Identifying the subset of criteria of the search associated with such partitioning criteria Search the subset of archived AIPs having a min./max. value range which fits the condition of the search
CORE PRINCIPLES For the purposes of this example, an archive package contains all the phone calls for a given month For example, all the phone calls for the month of February. One package may contain calls to different people (one call to Jack, two calls to Jill, etc.) The AIP Index field is the date on which calls where made... 2011 January 2011 Febuary 2011 March 2013 March
CORE PRINCIPLES Search example Now suppose we want to search for a given set of archived phone calls, specifically: All phone calls made to John on February 11, 2011
CORE PRINCIPLES Two Phase Search: Phase 1, Searching for the right AIP s Find the relevant AIP s that may contain the records for which you are searching The AIP Index is queried and all the AIP s that have records for February 2011 identified In this example there is only one such package In the next slide it is highlighted in green
CORE PRINCIPLES Management of AIP s Content Server Archive Holding AIP ID CallDate_min CallDate_max Cached in PhoneCalls 1 2011-01-01 2011-01-31 No PhoneCalls 2 2011-02-01 2011-02-28 Yes PhoneCalls 3 2011-03-01 2011-03-31 No PhoneCalls............ PhoneCalls 9 2011-09-01 2011-09-31 Yes AIP repository objects AIP ID 2 DAT A AIP ID 9 DAT A Cache Storage AIP s 2 & 9 are also stored in archive storage for no xdb backup purposes AIP ID 1 AIP ID 2 AIP ID 3 AIP ID 4 AIP ID 5 AIP ID 6 AIP ID 7 AIP ID 8 AIP ID 9 AIP ID 10 DA TA DA TA DA TA DA TA DA TA DA TA DA TA DA TA DA TA DA TA Archive Storage
CORE PRINCIPLES Determine whether the AIP is cached in Before we search the contents of an AIP we have to make sure that the Data Partition is attached to the Data Service search engine By looking in the AIP Registry we see that the answer is Yes If the answer is No we cannot search this data synchronously but have to issue an asynchronous search instead.
CORE PRINCIPLES Determine whether the AIP is cached in Content Server Archive Holding AIP ID CallDate_min CallDate_max Cached in PhoneCalls 1 2011-01-01 2011-01-31 No PhoneCalls 2 2011-02-01 2011-02-28 Yes PhoneCalls 3 2011-03-01 2011-03-31 No PhoneCalls............ PhoneCalls 9 2011-09-01 2011-09-31 Yes AIP repository objects AIP ID 2 DAT A AIP ID 9 DAT A Cache Storage AIP s 2 & 9 are also stored in archive storage for no xdb backup purposes AIP ID 1 AIP ID 2 AIP ID 3 AIP ID 4 AIP ID 5 AIP ID 6 AIP ID 7 AIP ID 8 AIP ID 9 AIP ID 10 DA TA DA TA DA TA DA TA DA TA DA TA DA TA DA TA DA TA DA TA Archive Storage
CORE PRINCIPLES Two-phase Search:Phase 2, Searching for the right AIU s in the AIP The AIP consists of a set of AIU s that are the phone call recordings We need to search the AIU metadata of these recordings to find all the recordings that were made to John on Feb 11 The good news is that we only have to search one AIP that contains all the phone calls for February By filtering out AIP s that do not contain AIU s that we need, we ensure that the search time is not influenced by the volume of archived data for a query having an identical search range i.e. it will of course take longer if you have to query 10 years of data instead of 6 months.
CORE PRINCIPLES Search the AIU s xdb The index generated & stored in the xdb Partition is now used to improve the query performance when searching for individual AIU s AIP ID 2 DATA
SCALABILITY Until now, the sizing has been driven by the demanded batch ingestion throughput not the search & retrieval activity to serve The key for increasing the global batch ingestion throutput is to be able to run an higher number of concurrent ingestions The search & retrieval workload unlikely to significantly impact the global sizing The ingestion workload profile is very different between structured data and unstructured data Performance is not sensitive to the already archived volume Each batch ingestion works in a single xdb partition A search including a partitioning criteria quickly narrows the XQuery scope to a subset of xdb partitions
SCALABILITY Type of increasing workload Batch ingestion throutput Searches Main stressed resources For structured data CPU of the ingestor and xdb tiers I/O on the ingestion working area & xdb file system For unstructured data CPU of the ingestor and Content Server tiers I/O on the ingestion working area & Content Server storage areas CPU & memory on the Web Services and xdb tiers I/O on the xdb file system # of concurrent users CPU & memory on the GUI tiers # of submitted background searches Transactional ingestion throughput CPU & memory on the Order processor and xdb tiers I/O on the xdb file system CPU on the Content Server and Web services tiers I/O on the Web services working area
SCALABILITY Vertical scalability Several instances of each tier can be run on a given server Each tier can be isolated on distinct servers* Concurrent executions VM Receiver VM Ingestor VM Order processor HTTP Load Balancing VM App Server App Server GUI HTTP Load Balancing VM App Server App Server Web Services VM xdb database#1 xdb database#n xdb cache VM Content Server VM RDBMS database Session affinity is mandatory for load balancing across GUI instances All Web services are stateless (i.e. session affinity not required for load balancing across WS instances) If needed, several distinct xdb databases can be concurrently used by the system The activity profile of the system unlikely to require multiple active Content Server or RDBMS instances for scalability purpose but it is technically possible as well * Without requiring to set up any shared file system across tiers
SCALABILITY Horizontal scalability Instances of each tier can be distributed among multiple servers Concurrent executions VM VM VM VM VM VM Receiver Ingestor Order processor HTTP Load Balancing VM VM App Server GUI HTTP Load Balancing VM VM App Server Web Services VM xdb database#1 xdb cache#1 VM xdb database#n xdb cache#n VM Content Server VM RDBMS database The activity profile of the system unlikely to require multiple active Content Server or RDBMS instances for scalability purpose but it is technically possible as well
SCALABILITY Use batch ingestion with large SIPs instead of synchronous ingestion with small SIPs whenever possible The associated workload is much lower and the ingestion is much faster Batch ingestion can be scheduled during low search/retrieval activity periods in order to optimize the usage of the platform Configuring distinct file system areas (local, distinct LUNs) for each ingestion node allows to horizontally scale the I/Os Host the xdb log area on a distinct file system area
RETENTION POLICY SERVICES Objectives Delegate to RPS the management of the retention and/or the retention markup of the AIPs. The RPS retention policies can be inherited from the parent folder or applied directly on the AIP during the reception or manually. The integration has been designed for : Imposing minimal constraints on the definition of the RPS policies. Not imposing an additional complexity to customers who do not need RPS (i.e. who have to only apply a basic date based retention provided by the existing built-in InfoArchive retention management). General principles InfoArchive (EIA) will never decide to purge an AIP if at least one RPS retention policy or an RPS markup is applied on an AIP. The RPS disposal event is being trapped by an EIA TBO on the AIP destroy for attaching the AIP to the EIA Purge lifecycle. This attachment is required for being able to generate the EIA confirmation message which might be configured. After the confirmation are generated, the execution of the Purge job will physically destroy the AIP.
RETENTION POLICY SERVICES Restrictions Batch ingestion only RPS features cannot be used with the synchronous ingestion. AIP only The application of RPS retention policies is only supported for AIP repository objects (i.e. eas_aip type or sub-types). Destroy All RPS policies applied on AIPs must use a disposal strategy including the destruction of the repository object (i.e. Export All, Destroy All or Destroy All ), the usage of other disposal strategy is not currently supported. All Renditions RPS retention policies applied on AIPs must apply the All Renditions strategy.
RETENTION POLICY SERVICES Applied Retention Policies and Retention Markups To know if an AIP is under RPS control Display the Retainer ID and Retain Content Until column. If the Retainer ID is not empty the AIP is linked to a Retention Policy and/or a Retention Markup. To have more details, it s necessary to go to View > Applied Retention or View > Applied Retention Markup You can see the current state (Active / Final)
RETENTION POLICY SERVICES Apply a Retention Policy Manually You can apply manually an Individual / Linked Retention Policy on an AIP / Folder by selecting the action menu Records > Apply Retention Policy Don t forget to select the Policy you want to apply at the top. During the reception You can decide to apply a RPS retention policy and an EAS retention period. At the Holding level, you can decide to apply a default Retention Class. The retention class is defined into the table. You decide to apply or not a RPS retention policy and/or an EAS retention period. The default value can be overwritten by a retention class provided by the SIP descriptor.
Disposition RETENTION POLICY SERVICES Date Based Retention When the Retain Until Date is expired, the retainer is eligible to be promoted to the Final state. The promotion is performed by the RPS job : dmc_rps_promotionjob The disposition is performed by the RPS job : dmc_rps_dispositionjob When an AIP is disposed, the object is moved to the folder /System EAS/data/purge and attached to the Purge lifecycle. You need to complete the Purge process to delete the object. Event Based Retention To be promoted to the Final state, all conditions must be satisfied. In the Applied Retention view, select an item and go to View > Properties > Info In the Phases tab, edit the Condition and enter an event date. Run the Promotion and Disposition jobs to continue the process. Privileged Delete You can force the disposition by using the action Records > Privileged Delete The AIP is moved to the folder /System EAS/data/purge and attached to the Purge Lifecycle. The Privileged Delete is only possible on an AIP in COM, REJ-DONE or INV-DONE state.
INFOARCHIVE QUESTIONS QUESTIONS #MMTM15 #INFOARCHIVE #EMCWORLD 51