Big Data Challenges to E-Discovery



Similar documents
Financial discovery and beyond using BMMsoft EDMT Solution

Electronic Discovery How can I be prepared? September 2010

Information Archiving

How to make BIG DATA work for you. Faster results with Microsoft SQL Server PDW

Reducing Risks and Costs in Legal Governance & Compliance. 2012, TERIS,

Top 5 reasons to choose HP Information Archiving

Il mondo dei DB Cambia : Tecnologie e opportunita`

SMART ARCHIVING. The need for a strategy around archiving. Peter Van Camp

EMC SourceOne Management and ediscovery Overview

Sean Byrne Head of ediscovery Solutions Michael Lappin Director of Archiving Technology

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Information Technologies and Fraud

A complete platform for proactive data management

Defensible Disposition Strategies for Disposing of Structured Data - etrash

BB2798 How Playtech uses predictive analytics to prevent business outages

Miguel Ortiz, Sr. Systems Engineer. Globanet

Top 5 reasons to choose HP Information Archiving

The Future of Data Management

Business white paper Top 10 reasons to choose Cloud-based Archiving

Lowering E-Discovery Costs Through Enterprise Records and Retention Management. An Oracle White Paper March 2007

Guide to Information Governance: A Holistic Approach

Big Data Analytics- Innovations at the Edge

Data Sheet: Archiving Symantec Enterprise Vault Discovery Accelerator Accelerate e-discovery and simplify review

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

How to Secure Your SharePoint Deployment

Can CA Information Governance help us protect and manage our information throughout its life cycle and reduce our risk exposure?

MANAGING BIG DATA IN LITIGATION

Navigating Information Governance and ediscovery

Archiving A Dell Point of View

Selecting the Right ediscovery Solution for Your Company

The Power of Risk, Compliance & Security Management in SAP S/4HANA

INSTALLATION MINIMUM REQUIREMENTS. Visit us on the Web

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Autonomy Consolidated Archive

IBM Information Archive for , Files and ediscovery

LEVERAGING EMC SOURCEONE AND EMC DATA DOMAIN FOR ENTERPRISE ARCHIVING AUGUST 2011

Intelligent document management for the legal industry

Hitachi Content Platform. Andrej Gursky, Solutions Consultant May 2015

Integration of E-Discovery and FOIA

Parallel Data Warehouse

Auto-Classification for Document Archiving and Records Declaration

Big Data Analytics Nokia

W H I T E P A P E R. Symantec Enterprise Vault and Exchange Server November 2011

Investor Presentation. Second Quarter 2015

Making Sense of the Madness

Symantec Enterprise Vault Discovery.cloud

Proactive Data Management for ediscovery

ediscovery AND COMPLIANCE STRATEGY

A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY

How To Manage Cloud Data Safely

Agile BI With SQL Server 2012

Document Storage Tips: Inside the Vault

IBM ediscovery Identification and Collection

CAPABILITY STATEMENT LEGAL TECHNOLOGIES AND COMPUTER FORENSICS. DECEMBER 2013

BIG DATA STRATEGY. Rama Kattunga Chair at American institute of Big Data Professionals. Building Big Data Strategy For Your Organization

IBM Unstructured Data Identification & Management An on ramp to reducing information costs and risk

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Die Herausforderung an Backup/Recovery durch das Datenwachstum Wie optimiere ich

EMC DATA DOMAIN EXTENDED RETENTION SOFTWARE: MEETING NEEDS FOR LONG-TERM RETENTION OF BACKUP DATA ON EMC DATA DOMAIN SYSTEMS

Transform records management

SOLUTION BRIEF KEY CONSIDERATIONS FOR BACKUP AND RECOVERY

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

ZL UNIFIED ARCHIVE A Project Manager s Guide to E-Discovery. ZL TECHNOLOGIES White Paper

Find the intruders using correlation and context Ofer Shezaf

Intelligent Information Management: Archive & ediscovery

巨 量 資 料 分 層 儲 存 解 決 方 案

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Legal Technologies Considering Third Generation ediscovery?

Data Centric Computing Revisited

Sample Electronic Discovery Request for Proposal

Symantec Enterprise Vault E-Discovery Connectors

Capstone for Records Management

UNDERSTANDING E DISCOVERY A PRACTICAL GUIDE. 99 Park Avenue, 16 th Floor New York, New York

CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

MIGRATING YOUR EMC SOURCEONE ARCHIVE

CA Message Manager. Benefits. Overview. CA Advantage

Transcription:

Big Data Challenges to E-Discovery Paul Krneta, BMMsoft, Inc. Perry J. Narancic, Esq., LexAnalytica, PC 2011 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

Agenda I. Computing Trends II. Definition of Big Data III. Legal and Regulatory aspects of Big Data IV. Technology approaches to Big Data in E Discovery V. In the Trenches with Big Data: A. EDRM Model and Big Data B. Case Study: Yahoo v. OnlineNIC, No. 08 5698 JF (ND Cal. filed Dec. 19, 2008), Verizon v. OnlineNIC, No. 08 2832 (ND Cal. filed June 6, 2008 )Microsoft v. OnlineNIC, No., 08 4648 (ND Cal. filed Oct. 7, 2008). VI. Conclusion

I. Computing Trends: the progress 1. CPU, disk, RAM, networks: 2x faster/bigger in 18 months 10x faster/bigger in 5 years, 100x in 10 years; 10,000x in 20 years 2. Price of HW and networks is dropping as fast as speed is growing Same $ buys you 10x faster/bigger system in 5 years, 100x in 10 years 3. Can your application SW take advantage of HW progress? Most SW is not designed with HW progress in mind 10 year old SW was designed for 100x slower HW 20 year old SW was designed for 10,000x slower HW 4. SW gets slower faster than HW gets faster

I. Computing Trends, cont d 1. Structured Data: 80% of all corporate records (sales, phone records, payment data, stock trades) are structured and stored in relational database but due to their small size (typ. < 1 KB) they represent only 20% of enterprise data volume SQL is the main analytic tool for structured data 2. Unstructured data emails, documents, multimedia represent over 80% of enterprise data volume (due to their large size of typ. 100 KB) but less than 20% in terms of number of records Text search is the main tool to search unstructured data Archiving of unstructured data is typ. separate from Text search 3. Structured data = huge number of small records 4. Unstructured data = big records, but smaller numbers

II. Big Data = size, speed and diversity Big Data happens when Moore s law meets data growth Data volume doubles every 18 months Big Data Challenges : Loading and indexing speed of Big Data has to be very high Search speed of Big Data must be very high Critical: SQL analysis and Text search must be unified Storage cost for Big Data can be astronomical if not done right examples of Big Data: databases records, emails, documents, SMS, video, audio, sensor data, social media and more To discover new information and relationships arising from the cross correlation of the entire data set, rather than isolated silos (i.e. emails or files or transactions)

II. Big Data, a marketing hype? Of course, there is Big Data hype just like there was Web hype, CRM hype etc. But Big Data is nothing new other than a name Big Data is here to stay and burden us See e.g. Steve Lohr, How Big Data Became so Big, NYT, Aug. 11, 2012. http://www.nytimes.com/2012/08/12/business/how big data became so big unboxed.html Bryant, Katz, Lazowska, Big Data Computing: Creating revolutionary breakthroughs in commerce, science and society, Computing Community Consortium, www.cra.org/ccc/docs/init/big_data.pdf

II. Big Data Applications Long and diverse list of Big Data uses Civil and Criminal Litigation E Discovery Regulation (e.g. SEC, HIPPA) Internal Policing and Dispute Avoidance (internal surveillance to catch problems, like harassment, fraud, etc..) National Security (cross correlating addresses, names, phone numbers) Audit Fraud Detection and Investigation

III. Technology Approaches to Big Data Multiple approaches to Big Data: federation : applications search multiple silos of data Relies on original data repository ( data silo ) to search data Slow, expensive, unreliable In place Indexing (similar to Google) Data sources are scanned and indexed to enable search Original data is not managed (=can be modified/deleted, no HOLD ) EDMT: EDMT stores emails, Documents, Multimedia and DB Transactions for data compliance, retention and analysis EDMT runs Text+SQL cross analysis of all data to allow ediscovery, Audit, Fraud Detection, CRM, GRC, BI etc. to run up to 1,000x faster than before Typical EDMT cost: under $1,000 per TB of data

III. 1 PB of data in EDMT: 2007 vs. 2012 1 PB (1,030 TB) of data 6 Trillion records loaded +indexed Loading+indexing speed : 285 B records per day 35 TB/day Fresh Data Loaded in < 2 sec Search time : < 0.5 sec

III. EDMT smaller/bigger than 1 PB High Mid Entry # emails & files (100KB each) DB rows store + index index only (150 byte) 7 16K [ 96 racks ] 15,360 41,472 180 B 1,800 B 640 Trillion 6 4K [ 24 racks ] 3,840 10,368 48 B 480 B 160 Trillion 5 1K [ 6 racks ] 960 2,592 12 B 120 B 42 Trillion 4 4XL [ Full rack ] 160 432 2 B 20 B 7 Trillion 3 PB [ 1/2 rack ] 80 288 1.6 B 16 B 6 Trillion 2 XL [ 1/3 rack ] 40 144 600 M 6 B 2 Trillion 1 L [ 1/4 rack ] 24 72 300 M 3 B 1 Trillion M [ 2 RU ] 12 36 150 M 1.5 B 500 B S [ 2RU ] 6 36 150 M 1.5 B 500 B XS [ 2RU ] 4 36 150 M 1.5 B 500 B OnLine Model Description EDMT Solution Models and Specifications # cores Disk Size (TB)

IV. In the Trenches with Big Data Big Data Challenges in E Discovery Volume Heterogeneity Distributed

IV. In the Trenches with Big Data, cont d The EDRM Model and Big Data Information Management: The problem of silosearching. The need for speed, completeness, comprehensiveness. Identification the problems with the traditional custodian and keyword approaches. National Day Laborer Org. v. US, 2012 WL 2878130 (SDNY) Preservation and Collection problems with custodian based preservation. Sanctions order in Apple v. Samsung

IV. In the Trenches with Big Data, cont d Case Study: Cross Analysis of Structured and Unstructured Data in the OnlineNIC Trilogy Facts: Yahoo, Microsoft and Verizon sued OnlineNIC, one of the largest domain name registrars in the world for cybersquatting. Alleged that defendant was using aliases to register infringing domains. Defense used BMMsoft technology to collect and cross correlate registration database and emails to show that the registrants were not affiliated with defendant Yahoo and Verizon cases settled, Microsoft case was voluntarily dismissed.

Conclusion 1. Big Data is real. 2. Big Data poses unique issues for E Discovery 3. Technology solutions need to be big, and unify structured and unstructured data into a single repository.

THANK YOU Paul Krneta paulk@bmmsoft.com Perry J Narancic perryn@bmmsoft.com

BACKUP SLIDES

EDMT extracts benefits from Big Data EDMT advantages: Scalability & Performance BI (SQL) Enterprise Ready Fast deployment, Easy maintenance Use cases: FLAC (Fraud, Legal, Audit, Compliance) BI & DW: Better Customer Perspective IT: Operational Efficiency FLAC (Text) EDMT Big Data 2.0