RPO represents the data differential between the source cluster and the replicas.

Technical brief Introduction Disaster recovery (DR) is the science of returning a system to operating status after a site-wide disaster. DR enables business continuity for significant data center failures for which high availability features cannot cover. Computer systems generally support DR in two ways: backups and replication. Backups entail full or partial copies of data from the master cluster that are stored on separate media. Replication, also known as mirroring, continuously copies data from the master cluster to a geographically remote instance of the system ( replicas or mirrors ). For production deployments, mirroring is the preferred strategy for DR. With either method, a copy of the data is available to restore and thus recover from the disaster. Backups involve restoring the saved data into an alternate cluster and enabling that cluster as the new master. DR with mirroring entails activating the mirror, which already has the data loaded, as the new master cluster. (Note that replication is also used to refer to the copying of data within a cluster in a data center to eliminate single points of failure and enable high availability.) In a related area, some systems support point-in-time snapshots, also known as checkpoints, to allow rolling data back to a prior state. This feature is generally used to recover from data corruption due to application or user error. For more information, please see the MapR Snapshots tech brief. DR requires planning to determine two objectives. The recovery point objective (RPO) is a planned estimate on how much data the organization can afford to lose in case of a disaster. In other words, this is a measure of the level of potential data loss. The recovery time objective (RTO) is the amount of time the organization can be on hold while the system is being recovered. This is a measure of potential downtime. These two objectives indicate that DR is a sliding scale, so organizations must plan how much cost and effort should be applied to limit data loss. Lower RPO and RTO values enable greater protection against data loss and downtime, but those will take more resources to implement. Backups tend to be the much cheaper option, but consequently result in both high RPO and RTO. Mirroring is more expensive due to the redundant hardware in the remote mirrors, but enables lower risk of data loss. RPO represents the data differential between the source cluster and the replicas. RTO represents the time it takes to recover a system after a disaster occurs.

2 Technical brief: in the MapR Distribution The MapR Distribution including Apache Hadoop includes backup and mirroring capabilities to protect against data loss after a site-wide disaster. MapR is the only distribution that provides built-in, enterprise-grade DR for Hadoop. MapR was built to address real-world DR scenarios where lost data and downtime result in lost revenue, lost productivity, and/or failed opportunities. To create backups, administrators first take a snapshot of the MapR cluster at the volume level. The snapshot will include all data in the volume, including both files and MapR-DB database tables. The snapshot completes in a few seconds and represents a consistent view of the data. This means that unlike other Hadoop distributions, the state of the snapshot will always be the same. The snapshot then can be written to another medium as a backup. In other Hadoop distributions, snapshots might change over time, depending on the state of open files when the snapshot was taken. Also, partially written files won t be captured when the snapshot is taken, making it difficult to create an accurate backup. To create remote replicas, MapR provides two features that enable DR for different use cases: Mirroring and Table Replication. MapR Mirroring is used to create remote mirrors of files. Mirroring supports the following characteristics that are critical for proper DR deployments: Scheduled. Using the browser-based MapR Control System (MCS), administrators can schedule how often mirrors are updated. Higher frequency of updates lead to lower RPO. Incremental. Only deltas are transferred from the master cluster to the replicas. If only an 8K block is updated at the master cluster, then only that block will be transferred in the next mirroring job. Efficient. Transferred data is compressed, and sent asynchronously and in parallel, and does not significantly impact system performance. Consistent. Prior to creating remote mirrors, a snapshot is automatically taken to ensure a remote mirror of a consistent, known state of the master. Checksums are run to ensure integrity. Atomic. Changes on the mirror are made only after all data has been received for a given mirroring operations. Flexible. Multiple mirroring topologies are supported, including cascaded and one-to-many mirroring. Resilient. Should there be a network partition during a mirroring operation, the system periodically retries the connection and resumes once the network is restored. Secure. Configurable over-the-wire encryption prevents network eavesdropping on the mirrored data. Table Replication is the (near) real-time mechanism for replicating data in MapR-DB database tables. Since database updates tend to occur much more frequently, rapidly, and granularly than file updates, this feature is required to minimize the differential between the master data and the replicas. Table Replication has the following advantages: Immediate. Every database update at the master cluster will be immediately transferred to the remote replica. This enables a very low RPO.

3 Technical brief: in the MapR Distribution continued Efficient. Transferred data is compressed, and sent asynchronously and in parallel, and does not significantly impact system performance. Multi-master. For global deployments that share common data, multi-master support lets geographically disbursed user groups perform both reads and writes on the data, and all distributed replicas will by synchronized. Resilient. Should there be a network partition during a mirroring operation, the system periodically retries the connection and resumes once the network is restored. Secure. Configurable over-the-wire encryption prevents network eavesdropping on the replicated data. MapR Implementation Once you ve determined your DR strategy, and thus your RPO and RTO requirements, you can leverage MapR features to support that strategy. Assuming you have a business-critical environment, this discussion will skip the backup option and instead focus on Mirroring and Table Replication. In most big data deployments, especially on MapR, a combination of files and database tables will be used, so using both features will enable a robust DR implementation. Achieving Low RPO with Scheduled Mirroring For files in your MapR cluster, use Mirroring on a scheduled basis to ensure remote mirrors frequently get the latest updates. The window of potential data loss depends on how frequently your mirroring operations are scheduled. Scheduled mirroring ensures synchronized production sites with DR sites (replicas). For an extra level of DR protection, such as to guard against multiple data center failures, use of different mirroring topologies including a cascaded mirror chain will create multiple remote copies. Cascaded mirror chains are also useful for creating more efficient delivery of mirror updates. For example, if your master cluster is in New York, and you want to mirror to Sydney and Singapore, it would make sense to mirror from New York to Sydney, and then have a separate mirror chain from Sydney to Singapore.

4 Technical brief: Achieving Low RPO with Scheduled Mirroring continued Mirroring can be chained from the source site to another and then to another. Achieving Low RPO with Table Replication With database tables, you automatically get low RPO since Table Replication continuously transfers all database updates to the remote clusters. This ensures that the master database and replica databases are closely synchronized. The window of potential data loss is never more than a few seconds. Achieving Low RTO with MapR Promotable Mirrors MapR remote mirrors are initially read-only to prevent inadvertent writes to the replica that result in inconsistency between master and mirror. But should a disaster occur, the mirror needs to be enabled as the (temporary) master cluster. The Promotable Mirrors feature lets you quickly activate (or promote ) a mirror into a read/write state, thus enabling it for use as the new master cluster. This means that the bulk of the RTO time will entail redirecting users at the network or application level to the new master cluster. Live data resides at the DR site, despite a disaster at the production site.

5 Technical brief: Achieving Low RPO with Scheduled Mirroring continued The DR site is promoted to be the new production site, to which users are redirected. Achieving Low RTO with Table Replication Since Table Replication ensures tight synchronization between the master database tables and replica tables, and those replica tables are already read/write enabled, no additional effort is required to activate a replica as the master. This means that as above, the bulk of the RTO time will entail redirecting users at the network or application level to the new master cluster. Conclusion When running a production deployment on Apache Hadoop, some of the same business continuity practices that you ve applied in your existing data architecture must be used. Should you face a site-wide disaster, you want to make sure you have a strategy in place to minimize data loss and downtime. With the MapR Distribution, you get the enterprise-grade disaster recovery capabilities that you would expect from any production-grade software system. MapR lets you define low recovery point objectives and recovery time objectives to meet your business requirements, while also minimizing the administrative overhead to achieve those objectives. MapR delivers on the promise of Hadoop with a proven, enterprise-grade platform that supports a broad set of mission-critical and real-time production uses. MapR brings unprecedented dependability, ease-of-use and world-record speed to Hadoop, NoSQL, database and streaming applications in one unified distribution for Hadoop. MapR is used by more than 700 customers across ad media, consumer products, financial services, government, healthcare, manufacturing, market research, networking and computers, retail/online and telecommunications as well as by leading Global 2000 and Web 2.0 companies. Amazon, Cisco, Google, Teradata and HP are part of the broad MapR partner ecosystem. Investors include Google Capital, Lightspeed Venture Partners, Mayfield Fund, NEA, Qualcomm Ventures and Redpoint Ventures. MapR is based in San Jose, CA. 2015 MapR Technologies.