1 AMERICAN PUBLIC UNIVERSITY SYSTEM Charles Town, West Virginia CLOUD COMPUTING AND THE IMPLICATIONS FOR E-DISCOVERY A thesis submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in INFORMATION TECHNOLOGY - DIGITAL FORENSICS by Ann Hearn Ziluck Department Approval Date: January 30, 2012 The author hereby grants the American Public University System the right to display these contents for educational purposes. The author assumes total responsibility for meeting the requirements set by United States Copyright Law for the inclusion of any materials that are not the author s creation or in the public domain.
2 ii COPYRIGHT PAGE Copyright 2012 by Ann Hearn Ziluck All rights reserved.
3 iii DECLARATION I hereby declare that the work has been done by myself and no portion of the work contained in this Thesis has been submitted in support of any application for any other degree or qualification on this or any other university or institution of learning. Ann Hearn Ziluck
4 iv ACKNOWLEDGEMENTS I wish to thank my APUS graduate professors Dr. Karen Paullet, Dr. Diane Barrett, and Professor Robert Marlett who have encouraged me to strive for excellence and to reach for my academic goals. If not for their constant communications and support throughout my undergraduate degree, and the exciting discussions about network security, digital forensics and project management, I would not have immediately pursued a graduate degree. These dynamic professors continually 'threw down the academic gauntlet' forcing me to respond to their challenges. It seemed impossible to end the lively dialog with these three individuals after an undergraduate degree. Knowing they were all teaching in the newly launched Masters program in Information Technology Digital Forensics made it a very simple decision to continue my education in this exciting program. I would also like to thank Professors Sammy Abaza, and Murali Ramaswamy for serving as my technical mentors throughout my coursework. Their tireless support for a myriad of questions well outside the general course outlines was greatly appreciated. Lastly, this document could not have been crafted without the continual support, patience and good humor of Dr. Novadean Watson-Stone. While I endeavored to write about the emerging technologies of cloud computing and e-discovery, she provided continual feedback, moral and academic support, editing guidance, and went to great lengths to procure requested sources that were in many cases only a few weeks old. Her guidance, enthusiasm and confidence in my abilities were invaluable.
5 v DEDICATION I dedicate this thesis to my son Michael Ziluck, my boyfriend Thomas Mondak, my parents John W. and Joan R. Hearn, my brother John W. Hearn, Jr. and my sister Barbara Allen- Lyall. Without their patience, understanding and support, academic encouragement, and most of all love, the completion of this work would not have been possible.
6 vi ABSTRACT CLOUD COMPUTING AND THE IMPLICATIONS FOR E-DISCOVERY by Ann Hearn Ziluck American Public University System, January XX, 2012 Charles Town, West Virginia Dr. Novadean Watson-Stone, Thesis Professor E-Discovery poses many challenges to legal and information technology teams who must quickly produce information required for litigation. The extraordinary volumes, types, electronic and manual formats, and repositories of data produced by even small organizations must be quickly searched and analyzed as part of the E-Discovery process. Data created and stored in cloud computing, social media and mobile computing environments intensify the typical complexities of e-discovery since traditional advanced search analytics, digital forensics and E- Discovery techniques are more difficult or in some cases impossible with these technologies. Cloud computing and social media environments often prevent extraction of metadata required for defensible production of information. Digital forensics is impossible in some cloud computing environments, preventing experts from obtaining defensible and reproducible data sets. Legal and information technology teams must carefully consider the use of cloud computing and social media environments and ensure their corporate constituents that they are empowered to perform structured, effective, and scientific e-discovery practices for regulatory and legal compliance.
7 vii Table of Contents COPYRIGHT PAGE...ii DECLARATION... iii ACKNOWLEDGEMENTS... iv DEDICATION... v ABSTRACT... vi Introduction... 1 Problem Statement... 2 Purpose... 3 Significance of the Study... 3 Literature Review... 4 E-Discovery... 4 E-Discovery in the Legal and Technical Communities... 6 E-Discovery Searching and Web E-Discovery and Social Media Challenges in E-Discovery Types of Cloud Environments Emerging Challenges from Cloud Computing Digital Forensics in the Cloud and Applications to E-Discovery Cloud Technology and E-Discovery Methodology Subjects and Setting Data Collection Technique Statistical Analysis Limitations of the Study Review of the Problem or Research Question Summary of Literature Review Recommendations based on the Literature Review Only References... 48
8 Cloud Computing - The Impact on E-Discovery Introduction E-discovery is an emerging and challenging field that combines the efforts of both legal and technology professionals in the production of electronic information required for litigation. Litigants are required to provide electronic information in a commonly agreed upon format often under extraordinary time constraints dictated by the courts. Organizations are required to search for and preserve documents and data, and must take reasonable steps to do so. This includes notifying all parties that might have the ability to destroy, alter, or otherwise compromise relevant information, and monitoring individual data custodians. The task of searching for and preserving the integrity of data is monumental, even when considering the volume of information created by a small organization that must search through all business transactions, paper documents, backup tapes and historical archives, foreign language documents, and any Web 2.0 information (Schuler, Peterson, & Vincze, 2009). Cloud computing is an emerging paradigm that provides organizations with remote, scalable, on-demand network computing resources that require only minimal management effort. It also offers substantial cost savings over traditional datacenter network operations. However, Mell and Grance (2011) from the NIST caution readers that "security, interoperability, and portability" (p. 3) are some of the major considerations or barriers to widespread adoption of these methodologies. Brodkin (2008) from the Gartner Group agrees, and adds, "Data integrity, recovery, privacy and regulatory compliance are key issues to consider" (p. 1). Brodkin (2008) also makes a strong statement that "Investigating inappropriate or illegal activity may be impossible in cloud computing" (p. 2). Cloud hosting providers may not be willing or able to produce network logs due to shared tenancy on internationally dispersed and constantly changing
9 2 servers. Brodkin (2008) continues by quoting Gartner researchers Heiser and Nicolett (2008) who commented that if the cloud vendor cannot not contractually commit their support for specific investigations and e-discovery practices, and demonstrate previous successful instances of discovery compliance, then "the only safe assumption is that investigation and discovery requests will be impossible" (Brodkin, 2008, p. 1; Heiser & Nicholett, 2008, p. 4). Rule 26 of the Federal Rules of Civil Procedure (FRCP) according to Cornell University Law School Staff (n.d.) are explicit with regard to the timely production of documents and information stored in electronic form owned by the disclosing party. Initial production and disclosure of information under the FRCP is generally required within 30 days of the Rule 26 conference between the two litigants. The literature suggests that cloud computing intensifies the common difficulties involved in e-discovery projects, taking them to an exponentially greater level, and may prevent discoverability entirely. Problem Statement Cloud computing environments can be used to host entire enterprise systems to empower a mobile workforce with the added benefits of considerably reducing costs and resources required to maintain a networked computing infrastructure. Cloud computing environments are also linked with social media and Web 2.0 data that plays an important part in the scope of discoverable enterprise data. Cloud computing seems to exacerbate the complexities of the practice of e-discovery making advanced search analytics, digital forensics and e-discovery far more difficult or impossible. Further research could illustrate these complexities and help to educate organizations when making cloud computing choices, and change the contractual assurances that cloud vendors must ultimately provide to empower their customers to perform e- discovery for regulatory and legal compliance.
10 3 Purpose The purpose of this research is to illustrate the complexities or impossibilities of e- discovery in cloud computing environments and determine if this knowledge can empower corporate customers to require discovery and forensic compliance by cloud vendors. Significance of the Study Cloud computing is an emerging paradigm that is experiencing widespread exposure and enormous marketing efforts in the information technology communities. Cloud vendors are working diligently to position themselves to take advantage of the potential market share of this technology. Literature in support of this paradigm promotes the potential for substantial savings in terms of networks and personnel, increased security, scalability, and even reduction of an organizations carbon footprint. What are not discussed in detail are the limitations or impossibility of e-discovery and digital forensic acquisition and investigation of enterprise data in the cloud environment. This paper will include literature reviews of expert commentary on the dangers of cloud computing with regard to e-discovery, and the need for verifiable proof that a cloud-hosting vendor will provide access to the network data required for digital forensics and compliance with e-discovery requests.
11 4 Literature Review E-Discovery Attorneys have performed searches for relevant documents for generations, combing through paper records in file cabinets, ledgers, logs and bank boxes. In many instances, the team producing the documents has "dumped" this information on the opposing party in an attempt to shift the burden of searching through voluminous records. Prior to the widespread use of computer systems, discovery of paper documents was typically possible to complete in the required time and considered an effective and virtually perfect way of producing information for litigation. Searching has always been a monumental task although modern computer systems have compounded these difficulties in staggering ways. Organizations create and store tens of thousands of times as many records as they did just two decades ago and this information is often stored in many places and varying formats. Pass (2011) notes that 487 billion gigabytes of data were created globally in 2008 and that this astounding number will double at least every 18 months. The practice of E-Discovery encompasses not only searching for relevant information using a wide variety of electronic search methods, but also the automation of these processes, manual review of electronically discovered data, as well as accessibility and retention policies and practices (Pass, 2011). Electronically stored information (ESI) is information that is created and stored in digital form and accessible using computer hardware and software. ESI normally includes metadata, which is data about data. When documents, images and many other types of electronic data are created, information is added either to the digital file or in an attached/associated file such as the name, type, size of the file, location and ownership. Digital cameras often record the date and time of image creation and word processing documents contain metadata that often includes the
12 5 author, time and date of creation and the last date modified. Digital audio files contain the album name, song title, date recorded, and artist. Metadata is created and updated automatically by the computer application and may be created by the end user. E-Discovery experts must locate documents and files relevant to their cases, but also the associated metadata that contains valuable evidence about each file (Schuler, Peterson & Vincze, 2009) The practice of digital forensics is also an important aspect of the E-Discovery process, since all information produced for litigation must also be defensible. Defensibility includes the people, search methodologies, collection methods, documentation, and chain of custody involved in the production of electronic data. Opposing counsel may question the methods of searching, producing, and chain of custody of files and documents. E-Discovery experts must therefore adhere to the sound and scientific practices of digital forensics in order to maintain admissibility of information in a court of law (Schuler et al., 2009). The Federal Rules of Civil Procedure (FRCP) that apply to all cases filed in federal court were amended on December 1, 2006 and included details regarding how electronically stored information (ESI) must be handled. Two notorious court cases, Zubulake v. UBS Warburg and Coleman v. Morgan Stanley, are considered catalysts for the FRCP amendment. Judges in these cases determined that data integrity must be forensically sound and that e-discovery responsiveness and diligent examination of evidence are requirements (Schuler et al., 2009). The FRCP amendments included specific guidelines for the early attention and agreement on the types and formats for production of electronically stored data (ESI) as well as the systemized processes required to locate data. Parties are tasked to agree on the search keywords that will produce relevant information for the case, as well as other details including the common format for delivered data (i.e. paper files,.pdf, digital files, or forensic images). Pretrial and discovery
13 6 planning conferences are also requirements for opposing parties as well as strict search and preservation rules once litigation has begun. The FRCP amendments also include the word reasonable with regard to the efforts required for searching. This somewhat ambiguous term is the cause of much discussion in courts and the legal community. It also highlights the need to examine organizational data retention and destruction policies, since it is reasonable to search only data that exists as part of a regular retention policy. Organizations that historically retained backup tapes from old systems no longer in use for many years may reconsider these policies with regard to potential E-Discovery requirements. If an organization only retains backup information for a specific period, it is not reasonable to expect them to search and produce data outside of those retention parameters (Schuler et al., 2009). The Coleman v. Morgan Stanley case in 2004 is an example where sound retention policies may have been critically important. A Morgan Stanley IT executive testified that all relevant records had been provided to the court, when in fact 1600 unsearched backup tapes were discovered that might have contained important information. The court responded by awarding almost $1.8 billion dollars in damages to Coleman and determined that Morgan Stanley had committed fraud. The omission of the old backup tapes may have been an honest oversight or an attempt to conceal digital evidence, but it certainly illustrates the importance of proactively managing data and creating logical retention and destruction policies (Schuler et al., 2009). E-Discovery in the Legal and Technical Communities E-Discovery combines the efforts of the legal, technical, and many other departments in a typical organization. There are significant costs and penalties if litigants violate the Federal Rules of Civil Procedures and organizations are becoming keenly aware of the importance of implementing effective discovery processes. It is important for organizations to be prepared to
14 7 respond to e-discovery requests in a manner that is accurate, reliable, cost-effective, and timely. Technology is now an important part of the E-Discovery process and can be used to reduce the time and effort required to provide documentation for review by the producing organization's legal counsel. A wide range of software systems are commercially available that not only facilitate the search processes, but also provide integrated data retention methodologies and forensically sound capture and segregation features designed specifically for E-Discovery (Schuler et al., 2009). Organizations create, receive, and store information in a wide range of formats and locations. Data relevant to an E-Discovery case may include paper records, , digital telephone, audio and video files, engineering CAD drawings, social media information, and information from database systems. Information may also be stored in countless places and on many types of digital media. Many organizations own and manage their own data centers and regularly backup their data to removable media or offsite storage systems. Relevant data may also be stored in cloud environments, vendor-hosted environments, or on servers owned by social media organizations. Staff in various offices may have company data stored on their desktop computers, personally owned computers, removable media such as flash drives and CD's and even on their own Smartphone's. Information technology staff must maintain a clear and detailed map of where all of their data is stored in the event that these widely distributed repositories must be searched for litigation. Organizations that create record retention policies must consider and map all possible data repositories since the scheduled destruction of information can provide enormous litigation cost savings if properly managed. Companies should avoid the necessity to explain to a court why information outside the retention period exists in some instances, and not others. Some
15 8 organizations are subject to specific retention periods for legal, contractual or other important reasons that must be factored into the overall plan. Mapping data locations and identification of data custodians can be an enormous task, however, if an e-discovery request is received, it is far better to have a documented plan and formal processes in place. The legal, technical, human resource and other departments may form a discovery response team to handle these tasks, with specialists who are responsible for various parts of the process. Litigants who can demonstrate structured policies with clearly defined retention and destruction periods are able to defend these policies in a court of law. If opposing litigants request information that is outside of the normal retention period, the organization cannot reasonably produce this information since it no longer exists (Schuler et al., 2009). The combination of efforts between legal and information technology departments is necessary to determine the E-Discovery readiness for the organization. The legal staff understand the litigation requirements, and the information technology staff know where and how information is stored. Working together these groups can determine the systems that are available or must be implemented to support discovery, and ways that these systems can be leveraged to reduce costs, risk and speed of data production for litigation. It may also be important to investigate available external resources that may help to facilitate and respond defensibly to an E-Discovery request. E-Discovery Searching and Web 2.0 The Federal Rules of Civil Procedure (FRCP) include specific rules regarding pre-trial activities. Rule 16(b) outlines the necessity for a Pretrial Conference where litigants discuss the disclosure provisions and any claims either party may have regarding privileged relevant information. Rule 26(f) of the FRCP mandates discussions regarding information preservation,
16 9 discovery of information and a mutually acceptable form of data production. Some litigants require paper copies of all information including relevant metadata for each item. Others may request.pdf formats, or even forensic images of the information. These early conferences and discussions also include the establishment of a list of search terms and date ranges that will be used in the search and discovery processes to locate relevant information. The most critical phase of the E-Discovery process according to Schuler, Peterson and Vincze (2009) is the effective search for data and identification since the requirement to protect, preserve and present relevant and forensically sound data in a court of law may be necessary for any organization. The demonstration that a litigant has made a "good-faith" or reasonable effort to locate and preserve data required for review is a somewhat gray area of the law but Schuler et al. (2009) note that the courts take immediate action if a litigant does not meet their obligations to preserve data. Litigants must provide relevant data including associated metadata in order to present the complete documentation requested. Metadata is both file and system data, which in the example of an would be the header information that contains the sender and recipient addresses, the IP address and other identifications pertinent to the transmission, plus the server logs that show receipt of the transmission (Schuler et al., 2009). Some organizations manage sophisticated fully integrated content and e-discovery management systems used to identify, preserve and manage data and documents used in litigation. These software systems provide advanced tools that allow legal and information technology professional's access to comprehensive search tools with the ability to mark discovered documents for retention. The search capabilities of some of the e-discovery software market leaders are capable of examining data from social networking sites, company database
17 10 and customer relationship management (CRM) applications and also electronic backup archives of and other documents (Schuler et al., 2009). James (2011) documents a webcast where E-Discovery professionals including Craig Ball, attorney and forensic technologist, discuss data mapping as the starting point for effective searching during the e-discovery process. According to James (2011) data mapping is the creation of a data inventory that is a critical element in the process, and is a guide of "where to look" for information that may be relevant to a case including paper, electronic data and documents, and voice mail messages. This process should involve checklists, in-person interviews with key custodians and always remembering to consider paper documents, home computers, and other places where corporate data may reside. Data custodians may forget data and documents that they produce in the course of their work so James (2011) notes that Craig highly recommends asking open ended questions and looking around the physical environment (James, 2011). Searching for information often includes a number of different approaches including basic search techniques of custodian self-collection, and hard drive imaging. Electronic searches include keyword, metadata and Boolean searches. More advanced approaches include analytic technology that allows clustering and categorization, as well as concept and contextual searches. Organizations may have relevant data in their legacy relational database systems, structured and non-structured data warehouses, Web 2.0 sites such as blogs and social media websites and on backup tapes and offsite data repositories used for disaster recovery. Many data warehouse systems allow sophisticated searching for data based on concepts and contextual information and can be very useful for locating information. An unstructured data analytical engine is capable of locating abstract references, where a search for BAS might locate documents that the acronym
18 11 BAS is related to including Basic Allowance for Subsistence', Broadband Access Server, Basic programming language and/or Bulgarian Academy of Sciences, or things such as references to botany in fifteen languages (Inmon et al., 2011). Schuler et al. (2009) caution readers that keyword, date and custodian searching may not be sufficient to provide and preserve relevant ESI since the many different search methods can produce widely varied results. Each software application used in an E-Discovery case should be tested thoroughly for accuracy. Some software systems handle things such as abbreviations and polysemy more effectively than others do. Natural language synonymy also presents problems as do variations of words and names such as "Bob" and "Robert" and the possibility of foreign languages. Schuler et al. (2009) also comment that traditional keyword search engines are unable to discover data on alternate data sources such as video and voice recordings and may result in millions of irrelevant matches (Schuler et al., 2009). Akers, Mason and Mansmann (2011) discuss the concept of analytic searching called "Latent Semantic Analysis" (p. 41). This technique includes production of documents based on both their content and keyword matches and electronically resembles the methods of human deduction from natural language, and which provides a weighting system for the results. The ideal E-Discovery search system according to Akers et al. (2011) would include efficient search capabilities able to analyze the full content of digital files and their metadata with comprehensive capabilities for weighting, analysis, validation and extraction of data. A virtual index stores the content and metadata separately and can grow exponentially while still allowing autonomous searches on either metadata or content. Sophisticated systems must be able to perform virtual indexing to handle the large data sets in use today that may be many terabytes in size (Akers et al., 2011).
19 12 Although some sophisticated software applications are capable of searching Web 2.0 content, there are difficulties producing the associated metadata from these websites. Web 2.0 interactive collaborative websites such as blogs, social media websites and wiki's will not or cannot provide the complete metadata. However, this information is still being used in American courts, typically in printed or web archive formats, yet does not contain the actual associated metadata that is normally required for defensible data production (Meyer, 2009). There are many different approaches to data identification and searching for relevant data during an e-discovery event. Some organizations employ outside legal and technical resources to assist with what can be an arduous process; others maintain a comprehensive mix of strict data policies and both content management and e-discovery software that is continuously updated and maintained. Regardless of a formal structured policy-based approach or one that is specific to a particular discovery case, the processes must be structured, well documented, and detailed enough to ensure that opposing council cannot claim there is a lack of participation and timely production of defensible data. E-discovery teams may place holds on many different types of data including forensic images, physical documents, and data files containing financial figures, documents, voice messages and even content from company social networking or web sites. The manual approach that some organizations take may seem cumbersome or even prehistoric to others who spend millions of dollars annually preparing for the eventual litigation request, but although the methods may be vastly different, the result of producing defensible, repeatable, and timely data is a necessity (Schuler et al., 2009).
20 13 E-Discovery and Social Media The volume of data that is created on social media websites has reached astronomical proportions and is expected to increase substantially in the coming years. Frazier (2011) reports that Facebook users post over 30 billion individual pieces of digital content each month including comments, links, images, and videos. In 2010, over 25 billion tweets were posted on the popular Twitter website and that same year over 6 trillion text messages were exchanged. The lines between personal and organizational use of social and mobile media are typically very clear; however, many company-owned computers and mobile devices are being used by employees for non-company activities (Frazier, 2011). Some organizations consider regulating employee use of social media yet this does little to stop their staff from engaging in social media discussions and posting updates to these popular websites through the use of their own Smartphone's and computers. Frazier (2011) comments that social and mobile media has been over-dramatized to some degree with regard to e- discovery and that "E-Discovery regulations do not pose an immediate and large threat to large corporations" (p. 1). Frazier (2011) notes that courts and E-Discovery experts distinguish between corporate and individual social media accounts allowing ample time for legal teams to consider a proactive and pragmatic approach to collecting this information. Jaeger (2011) disagrees and provides statistics from an enterprise strategy group that 58 percent of organizational respondents already include social media applications in their E- Discovery collections including data from Facebook, Twitter and LinkedIn. Jaeger (2011) also reports that this same study reveals that cloud-based data "will parallel the expected rise in social media discovery" (p. 1).