Web Archiving for ediscovery

RELIABLE AND DEFENSIBLE WEB ARCHIVING Web Archiving for ediscovery Web and social media archiving services for ediscovery Right now, regulatory agencies are refining their views on how they define web content, and how it should be captured for compliance audits. In cases of e-discovery, courts expect website and social media content to be treated like any other ESI, and preferably in native format. As companies receive more of these requests, web archiving is now essential to the protection and preservation of your business. The number of requests and the costs of e-discovery have increased exponentially over the past few years. This emergent trend will grow, even more, in the future. Regulatory agencies, legal proceedings, and compliance mandates are setting the baseline for future governance of consumer-facing website and social media content. This is the landscape of e-discovery. To auditors and courts, web and social media is now viewed in the same context as other electronically stored information (ESI) such as emails, file documents, contracts, instant messages, and other files of that type. The challenge here is that websites, online media, and social media are technically very different, making their capture and preservation much more complicated. The solution? Native-format web archiving. United States Hanzo Archives, Inc. 250 Columbus Avenue Suite 204 San Francisco, CA 94133 www.hanzoarchives.com contact@hanzoarchives.com +1 415 683 7667 Europe Hanzo Archives Limited Heathrow Business Centre, 65 High Street Egham, Surrey, TW20 9EY www.hanzoarchives.com contact@hanzoarchives.com +44 020 3514 1322 Registered in England 5410483; VAT GB 912 8708 19 1/9 Copyright 2013 Hanzo Archives. All Rights Reserved.

Why To Archive During the normal life cycle of Document Retrieval a corporation s website and social media communications, web content is constantly updated, moved and redesigned both through personnel involvement, data updates, software upgrades and browser evolutions. This poses a significant challenge in keeping dynamic content legal and providing proof that it s authentic and correct when submitting data during the e-discovery process. Producing websites and social media communications in court requires more than printouts, backups or system restorations. Your electronic information must be presented in an authentic way, as an original, contextual experience. Depending on the capture, file formats, storage and access methods employed in your archiving policy, requesting parties can view and/or replay websites and blogs, social media conversations, formbased customer interactions, Flash presentations, membership-only websites essentially all archived, historical data and behavior. Case in point: In 2017, your company faces litigation regarding consumerfacing product or service content. Content, which was published across your company s global network of websites and social media accounts in March, 2011. The e-discovery requirements stipulate the data you present must be presented contextually and in native format (which is increasingly more common today imagine what will be required in 2017). Without native format web archiving, your legal team then must engage in reactive archiving gathering backed-up data, converting into usable formats, etc... It s costly, has proof of authenticity issues, and (depending on the size of your web presence) could mean sorting through and organizing vast quantities of data. With native format web archiving, your legal team can pull up your web archives on demand, in a web browser of your choice, and present your web presence in the same contextual experience users at a specific time saw your websites and social media accounts. This information is available in the cloud, or on premise, in native format and derivative formats, such as PDFs, if necessary. What To Capture The Document Life Cycle Active Use Archive Deletion In a word: everything. The reach of most corporations across the Internet is vast. Information for customers, suppliers, and regulators is published in multimedia websites, extranets, intranets, blogs, wikis, customer forums, online marketing campaigns and social media accounts with the list continuing to expand. Time Copyright 2013 Hanzo Archives. All Rights Reserved. Confidential Information. 2/9

This puts the onus on your business for every bit of information and communication that takes place on-line and under your company name. Websites and all web content Social media accounts Online marketing collateral Solutions and Options For Capture As the web evolves, so must web archiving. It now offers a choice of approach to both archiving policy and supporting technology. These choices should be considered carefully against business objectives before a buy decision is made. The main differences lie in the capture and access methods used. Content Capture Broadly, three different methods exist to capture and archive web content: 1. Client-side archiving 2. Transaction archiving 3. Server-side archiving Client-side Archiving Uses an archival crawler, originally derived from search engine crawlers; with more modern crawlers incorporating advanced content manipulation and other innovative approaches significant enhancements to ensure that complex and hard-to-reach content can be found, captured, and stored without change. Starts from seed URLs or entry points and captures pages, parsing them to extract all links. The process repeats and continues as long as newly discovered pages remain within the scope defined for the crawl. Embedded files are stored unchanged, and preserved in a structured, standards-based and self-contained file format, designed specifically to preserve web content. These files can be confidently considered as futureproof. NOTE: To be effective, this method requires a crawler with excellent link extraction and pathfinding algorithms, is content aware, and operates in a wide range of circumstances and site/page designs. Transaction Archiving Consists of the systematic capture and archiving of all browser/server exchanges (request/response pairs), resulting from the interaction of users Copyright 2013 Hanzo Archives. All Rights Reserved. Confidential Information. 3/9

with sites, regardless of their content type and how they are produced (this includes within the Cloud and on mobile devices). Enables tracking and recording of every instantiation of content, either in an HTML form, or database record, to be maintained and preserve over time. Archives hidden web content read by the websites users during the capture time. Unfortunately transaction archiving alone can result in partial or confusing captures of the content viewed, and lack overall context. NOTE: Transaction archiving generates unnecessary duplicates of frequently visited pages, and raises serious privacy concerns as the method implicitly relies on usage tracking. Frequently used in e-commerce and other transactional content. Server-side Archiving Copies files directly in the document folders to back-up servers. Relies on all original CMSs, databases and other software being archived with the content and/or actively maintained in an operational state, and may need to be migrated to newer CMSs databases, etc., during the entire period of archive retention. It may be straight forward to backup your websites, although not necessarily for cloud-based services generally. Even so, it is less simple and potentially expensive to restore, as over time it will be increasingly difficult to match software versions with the appropriate backups and machines to recreate the original content. This is why backups are suited for business continuity purposes over relatively short timescales, which explains why using backups for e-discovery can be an expensive process. Proper web archives, on the other hand, are optimized for e-discovery. Remains useful mainly in situations where it s required to archive parts of websites that a client-side crawler cannot reach. NOTE: IT backups rely on server-side archiving in almost all cases, systematically failing to meet legal and compliance requirements. Comparison of Content Capture Methods The main content capture methods are further summarized in the following table, where: SS = Server-Side Archiving Tx = Transaction Archiving CS = Client-Side Archiving = fully supported = possible / custom development. Copyright 2013 Hanzo Archives. All Rights Reserved. Confidential Information. 4/9

SS Tx CS Content captured as user sees it, unchanged, and authentic Archive access independent of original publishing technology Able to capture interactive or query based content Retains web URL space (not dependent on server link mapping) De-duplication possible Easily directed and scheduled capture Flexible archival scope, for a wide range of needs Able to capture browser/server exchanges (request/response pairs) Web server technology independence Archiving services can be provided in one place Cost effective and efficient operations over time Select Archive Autorised user View Client-side archiving in native format enables the broadest support for capture of modern web content and social media, for on demand access to archived content in a range of forms. Hanzo Archives specializes in client-side native-format archiving, with a superior modern crawler, Hanzo is able to capture most web and social medial content. Plugins are available enabling server-side and transaction archiving to be deployed alongside where circumstances require it. Instant Native Access The best feature of your web archives is their ease of use. Date-stamped and forensically sound, your web and social media archives are available on-demand in your browser. Or, can be delivered to your enterprise archive. Your archives don t depend on any specific OS, browser, or software versions, they re self-contained and immune to obsolescence. This allows for easily viewing, browsing and sharing your archived data on demand. Use a web browser to review and playback your web archives in their native format. Engage with the original, contextual experience of your rich and dynamic website and social media content as presented on the exact date and time of capture. Determine specific levels of sharing or access for third parties. Copyright 2013 Hanzo Archives. All Rights Reserved. Confidential Information. 5/9

Web Archiving For ediscovery In most cases client-side archiving is the best approach for capturing content for e-discovery, information management, and cultural heritage preservation. The quality of the resulting archive depends mainly on the capabilities of the crawler, particularly regarding link extraction, even when links are encoded in scripts and executables. This is one of the key determinants for capture of all files in a consistent and timely manner. Archiving web-based documents for legal and regulatory compliance requirements is technically challenging and can be costly and timeconsuming to implement. Hanzo s understanding of these challenges, coupled with its unique technology, enables corporations to address their web archiving problems today, with the confidence that should they need to access old content, it will be quick and straightforward. Hanzo s continuous innovation in web archiving techniques and technology ensures its customers always have access to the most effective, innovative methods of meeting regulatory and legal requirements. About Our Archiving Technical Overview Hanzo Archives enterprise web archiving and social media archiving software collects websites and social media in native-format. It preserves them in a forensically sound archive that makes all archived content available on demand. Optimised for web content in its native form, our solutions are the most comprehensive in the industry, and are compliant with regulatory and litigation needs. Hanzo crawlers gather web content and social media, either on demand, or by predetermined schedule. This is ingested into the web archive. Schedule Target Site WARC File Hanzo Website Capture WARC/1.0 WARC-Type: response WARC-Record-ID: <urn:uuid:7004e5fd-3f87-f10f-0539- e116845021fe> WARC-Date: 2011-11-01T18:12:33Z WARC-Target-URI: http://news.bbc.co.uk/ WARC-Concurrent-To: <urn:uuid:58c7c6ab-458e-008acda5-41ccda1ce188> X-Hanzo-Page-Id: <bbc.warc> X-Hanzo-Page-Uri: http://news.bbc.co.uk X-Remote-Host: 212.58.246.82:80 X-Hanzo-Record-Id: 875a0395-c82e-46b4-a251-0f910535a344 WARC-IP-Address: 212.58.246.82 Content-Type: application/http;msgtype=response Content-Length: 939 WARC-Block-Digest: sha256:1c5006cbd371f39a25f3bbae6ee30c853b6961dad47767304683 1bf9bb69579f HTTP/1.1 301 Moved Permanently Date: Tue, 01 Nov 2011 18:12:33 GMT Server: Apache Set-Cookie: BBC- UID=44ae0be013563901a7a39e9291faeaa9f756127670a0c1ffd299c6b 4a41f72f60Mozilla%2f5%2e0%20%28Macintosh%3b%20U%3b%20Intel %20Mac%20OS%20X%2010%5f6%5f5%3b%20de%2dde%29%20AppleWebKit %2f534%2e15%2b%20%28KHTML%2c%20like Location: http://www.bbc.co.uk/news/ Cache-Control: max-age=0 Expires: Tue, 01 Nov 2011 18:12:33 GMT X-Original-Content-Length: 234 Keep-Alive: timeout=5, max=693 X-Original-Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 Content-Length: 234 Metadata Hash Headers Native format content <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>301 Moved Permanently</title> </head><body> <h1>moved Permanently</h1> <p>the document has moved <a href="http://www.bbc.co.uk/ news/">here</a>.</p> </body></html> Each page with its metadata, embedded content, attached documents, and media items, is collected, and available on demand in the archive access tool, or can be exported in native format or in derived formats, such as Copyright 2013 Hanzo Archives. All Rights Reserved. Confidential Information. 6/9

PDFs, while retaining the original evidence. Users are able to browse or search the archive content within their web browser. In addition, each page, its metadata, embedded content, attached documents, and media items, contained in the archive can be passed to an enterprise archive through Hanzo s Web Archive Connector. You are then able to discover and examine archived web content and social media in your preferred enterprise archive and e-discovery tools. Let s break this down a little bit. 1. Collect web content and social media in native format, and create alternative renderings such as PDFs, and ingest into the web archive 2. Extract meaningful information, such as metadata, authors and commenters, as well as forensic information and ingest into the web archive 3. Select (optionally) a third party ingest wrapper or data interchange package in XML or other formats using Web Archive Connector 4. Send package to third-party enterprise archive and ingest 5. Discover and examine archived web content and social media alongside your email and other enterprise ESI 1. Collect web content and social media in native format and create renderings 2. Extract meaningful information and ingest into Web Archive Hanzo Archives Crawler Hanzo Archives Enterprise Web Archive 3. Create data interchange package Web Archive Connector 4. Ingest into third-party systems Enterprise Archive or Discovery Tool See archived material in native format Optional link-back Discover and examine archived web content and social media alongside your other ESI Copyright 2013 Hanzo Archives. All Rights Reserved. Confidential Information. 7/9

Features and Benefits Hanzo s web archiving technology is designed to provide the most flexible and reliable solution for producing web evidence for e-discovery, compliance, and records management. Feature Comprehensi ve capture Native form Standards based Forensically sound Always available Quality assured Comprehensi ve playback Export FINRA and SEC 17a-4 compliant Enterprise ready Benefits of client-side native format web archive HTML, CSS, Javascript, documents, links, Flash, video and form-based content is fully captured. Content is stored without change or transformation, to ensure long term access using standard technology with no added costs. All archive content is stored in ISO 28500 WARC files to ensure longevity and portability relative to technology changes over time. All content is time-stamped, with comprehensive metadata, and digests, to prove forensic authenticity and provide reliable evidence. Archive provides on-demand access, requiring no backup restores, which means significant ROI compared to backups or CMS. Experienced and professional archive processes ensure quality of capture and playback, increasing validity for auditors and courts. All HTML, CSS, Javascript, documents, links, Flash, video and form-based content is browsable and searchable in the browser. Archives can be exported to external web archives, native files, or static PDF documents. Hanzo warrants compliance with SEC Rule 17a-4 and FINRA Regulatory Notice 10-6. Providing significant ROI over fines. Extend enterprise repository and e- discovery solutions to incorporate web content with Hanzo Web Archive Connector. Copyright 2013 Hanzo Archives. All Rights Reserved. Confidential Information. 8/9

In summary: Collect and playback all formats of pages, embeds, forms, navigation, and links Re-create the user experience: browsing links, play Flash, fill forms, navigate Strong forensic information and metadata provide the most robust evidence possible Available for pro-active deployment as part of your information management policy or for reactive deployment in anticipation of, or during, an e-discovery procedure. Conclusion Archiving websites and social media content for e-discovery or to comply with other legal and regulatory requirements is technically challenging and can be costly and time-consuming to implement. The methods for doing so are numerous, but there is one clear path that is straightforward to implement, low risk, and cost effective client-side native format web archiving. Hanzo s understanding of the challenges, coupled with its unique advanced technology, enables corporations to address their web archiving problems today, with the confidence that should they need to access old content, it will be quick and straightforward. Hanzo s continuous innovation in web archiving techniques and technology ensures that its customers always have access to the most effective, innovative methods of meeting regulatory and legal requirements. Copyright 2013 Hanzo Archives. All Rights Reserved. Confidential Information. 9/9