TRUSTED ARCHIVE OVERVIEW
ARCHIVE RECONCILIATION AT MASSIVE SCALE USING MERKEL TREES Enterprise communications have experienced explosive growth driven by the continuous shift to digital workflows and the pervasive use of social media, mobile and web channels for customer engagement. As increasingly complex record-keeping requirements are introduced by legal authorities, staying compliant across all communication channels continues to be a challenge for enterprises. One of the main components of compliance is record keeping, the process of ensuring that data is accessible in an archive for future use. To preserve the integrity of archived data, a mechanism must be in place to guarantee that data is not corrupted or lost. This mechanism is reconciliation, i.e., the process of establishing consistency among data from a source to a target store and vice versa. Reconciliation ensures that a set of records is authentic and correct versus a golden copy, i.e., that they match in digital signatures (e.g., checksums, fingerprints) and in record counts. The process must be efficient to be able to handle the hundreds of terabytes of data and hundreds of millions of records typically found in a compliance archive. The challenge is to find or develop a mechanism that both detects discrepancies in large datasets in a space- and time-efficient way and offers a means of resolving such discrepancies in a timely manner. The figure on the next page describes the Bloomberg Vault reconciliation process. This process employs a sophisticated data sequence and hash tree (Merkle Tree) technology to efficiently compare and synchronize a range of data objects between client and cloud.
The Bloomberg Vault Reconciliation Process WITHIN CLIENT 7 WITHIN CLOUD 4 COMPARE ROOT NODE COMPARE SUB-NODES 3 RETRANSMIT MISSING DATA 2 6 DATA 1 DATA 5 ARCHIVE 1. Transmit Data data local to the client is transmitted to the cloud-based archive and ingested and stored there. 2. Determine Range a range of data determined by count or time threshold is selected from the local store by the client to be reconciled with the cloud. 3. Create Sequence client uses the embedded sequence in the objects in the data range, i.e., brings the objects into a deterministic order. 4. Calculate Hash Tree a hash signature is calculated for each object. The hash signatures are used as leaf nodes to generate a tree of hashes. 5. Retrieve Data the cloud retrieves the objects for a given data range from its archive. 6. Determine Range, Use Sequence and Retrieve Data the cloud performs the same sequencing and hash tree calculation tasks as described in steps 2 4. 7. The Reconciliation Process starts with a comparison of the root value of the hash trees generated for a given data range. A match for a data range digitally certifies that the corresponding dataset is the same between client and cloud. If the values don t match, the process recursively traverses the hash tree to find the corrupted objects, i.e., those for which the hash signatures don t match. These objects are retransmitted in a subsequent step. TRUSTED ARCHIVE OVERVIEW 2
Bloomberg Vault reconciliation uses an efficient protocol between a client and the cloud-based archive to execute the reconciliation process. The protocol is implemented by a client component that runs as part of an enterprise communication or collaboration service and by a hosted service offered as part of Bloomberg s cloud. Clients that participate in the reconciliation process maintain a copy of all data that needs to be reconciled. This local copy can be a file share, a mail server inbox or a local store of a SMTP server; it is often called the golden copy. The size of the golden copy depends on the resiliency requirements of the customer, but once the data has been reconciled, it can be destroyed and the space reused. The example below depicts implementation and protocol using Microsoft Exchange Server and a plugin. Reconciliation Client Plugin in Exchange maintains the hash values and sequence numbers for all email communications that are transmitted to the cloud for archiving purposes. At the end of a configurable time (e.g., end of day), the Plugin starts the reconciliation process for the range of data transmitted. The first step is to sequence the hash values for each of the individual emails in the data range to be reconciled. After all emails have been sequenced, digital signatures and the hash tree are generated, and the Plugin sends a Start Reconciliation message to the cloud service. The cloud service, once it receives the Start Reconciliation message, retrieves all email messages that have been ingested, stored and indexed on behalf of the client for a given data range. It performs the same sequencing and hash tree generation as the Plugin and notifies the Plugin when it is ready to reconcile for a given data range. After the Start Reconciliation steps are completed, the Client Plugin and Hosted Service exchange hash tree nodes within a reconciliation process that efficiently compares data between the two systems. The process either digitally certifies that the transmission was correct the root hash nodes match or it progresses until the corrupted or missing email messages are identified. Implementation and Protocol Example MICROSOFT EXCHANGE SERVER BLOOMBERG VAULT Reconciliation Client Plugin PROTOCOL ENGINE RECONCILIATION ENGINE Reconciliation Protocol Start Reconciliation Data Range Get Node Hash Reconciliation Service PROTOCOL ENGINE RECONCILIATION ENGINE Data Management Search Reporting Analytics HASH TREE GENERATOR Node Hash Value HASH TREE GENERATOR SEQUENCE GENERATOR ARCHIVING Stop Reconciliation Purge Records or Retransmit SEQUENCE GENERATOR RANGE GENERATOR Storage LOCAL STORE ADMIN Ingestion TRUSTED ARCHIVE OVERVIEW 3
Based on its configuration, the process either alerts the operator to the error condition or starts a retransmission of the missing or corrupted data. At the end of a successful reconciliation process, the client Plugin can safely remove all data stored locally. We have used the Trusted Archive technology to secure the ingestion of hundreds of terabytes of client data; it has proven invaluable in maintaining our 24x7 operation with automatic endto-end reconciliation and in helping our customers stay compliant. We can build a hash tree at a rate of about 2,840 documents/second on each side (enterprise and cloud); the actual tree comparison across a private line takes less than 10 seconds, including wire delay. For a typical dataset (500,000 messages across 2 machines at 3,000 messages/ second), the reconciliation process takes under 100 seconds. This compares favorably with the typical process employed in the industry, i.e., to transfer counts, IDs and fingerprints to the archive and loop over the stored objects to compare them a process, that, at best, takes days to complete and involves manual steps. TAKE THE NEXT STEP Learn more about Bloomberg Vault and archive reconciliation at massive scale using Merkel Trees. Contact us at +1 212 617 6580, vaultsales@bloomberg.net or call your regional representative to schedule a personalized demonstration. = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = BEIJING +86 10 6649 7500 FRANKFURT +49 69 9204 1210 LONDON +44 20 7330 7500 NEW YORK +1 212 318 2000 SÃO PAULO +55 11 2395 9000 SYDNEY +61 2 9777 8600 DUBAI +971 4 364 1000 HONG KONG +852 2977 6000 MUMBAI +91 22 6120 3600 SAN FRANCISCO +1 415 912 2960 SINGAPORE +65 6212 1000 TOKYO +81 3 3201 8900 bloomberg.com/vault The data included in these materials are for illustrative purposes only. 2014 Bloomberg L.P. All rights reserved. S535654726 1214 DIG