Big Data Challenges to E-Discovery Paul Krneta, BMMsoft, Inc. Perry J. Narancic, Esq., LexAnalytica, PC 2011 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
Agenda I. Computing Trends II. Definition of Big Data III. Legal and Regulatory aspects of Big Data IV. Technology approaches to Big Data in E Discovery V. In the Trenches with Big Data: A. EDRM Model and Big Data B. Case Study: Yahoo v. OnlineNIC, No. 08 5698 JF (ND Cal. filed Dec. 19, 2008), Verizon v. OnlineNIC, No. 08 2832 (ND Cal. filed June 6, 2008 )Microsoft v. OnlineNIC, No., 08 4648 (ND Cal. filed Oct. 7, 2008). VI. Conclusion
I. Computing Trends: the progress 1. CPU, disk, RAM, networks: 2x faster/bigger in 18 months 10x faster/bigger in 5 years, 100x in 10 years; 10,000x in 20 years 2. Price of HW and networks is dropping as fast as speed is growing Same $ buys you 10x faster/bigger system in 5 years, 100x in 10 years 3. Can your application SW take advantage of HW progress? Most SW is not designed with HW progress in mind 10 year old SW was designed for 100x slower HW 20 year old SW was designed for 10,000x slower HW 4. SW gets slower faster than HW gets faster
I. Computing Trends, cont d 1. Structured Data: 80% of all corporate records (sales, phone records, payment data, stock trades) are structured and stored in relational database but due to their small size (typ. < 1 KB) they represent only 20% of enterprise data volume SQL is the main analytic tool for structured data 2. Unstructured data emails, documents, multimedia represent over 80% of enterprise data volume (due to their large size of typ. 100 KB) but less than 20% in terms of number of records Text search is the main tool to search unstructured data Archiving of unstructured data is typ. separate from Text search 3. Structured data = huge number of small records 4. Unstructured data = big records, but smaller numbers
II. Big Data = size, speed and diversity Big Data happens when Moore s law meets data growth Data volume doubles every 18 months Big Data Challenges : Loading and indexing speed of Big Data has to be very high Search speed of Big Data must be very high Critical: SQL analysis and Text search must be unified Storage cost for Big Data can be astronomical if not done right examples of Big Data: databases records, emails, documents, SMS, video, audio, sensor data, social media and more To discover new information and relationships arising from the cross correlation of the entire data set, rather than isolated silos (i.e. emails or files or transactions)
II. Big Data, a marketing hype? Of course, there is Big Data hype just like there was Web hype, CRM hype etc. But Big Data is nothing new other than a name Big Data is here to stay and burden us See e.g. Steve Lohr, How Big Data Became so Big, NYT, Aug. 11, 2012. http://www.nytimes.com/2012/08/12/business/how big data became so big unboxed.html Bryant, Katz, Lazowska, Big Data Computing: Creating revolutionary breakthroughs in commerce, science and society, Computing Community Consortium, www.cra.org/ccc/docs/init/big_data.pdf
II. Big Data Applications Long and diverse list of Big Data uses Civil and Criminal Litigation E Discovery Regulation (e.g. SEC, HIPPA) Internal Policing and Dispute Avoidance (internal surveillance to catch problems, like harassment, fraud, etc..) National Security (cross correlating addresses, names, phone numbers) Audit Fraud Detection and Investigation
III. Technology Approaches to Big Data Multiple approaches to Big Data: federation : applications search multiple silos of data Relies on original data repository ( data silo ) to search data Slow, expensive, unreliable In place Indexing (similar to Google) Data sources are scanned and indexed to enable search Original data is not managed (=can be modified/deleted, no HOLD ) EDMT: EDMT stores emails, Documents, Multimedia and DB Transactions for data compliance, retention and analysis EDMT runs Text+SQL cross analysis of all data to allow ediscovery, Audit, Fraud Detection, CRM, GRC, BI etc. to run up to 1,000x faster than before Typical EDMT cost: under $1,000 per TB of data
III. 1 PB of data in EDMT: 2007 vs. 2012 1 PB (1,030 TB) of data 6 Trillion records loaded +indexed Loading+indexing speed : 285 B records per day 35 TB/day Fresh Data Loaded in < 2 sec Search time : < 0.5 sec
III. EDMT smaller/bigger than 1 PB High Mid Entry # emails & files (100KB each) DB rows store + index index only (150 byte) 7 16K [ 96 racks ] 15,360 41,472 180 B 1,800 B 640 Trillion 6 4K [ 24 racks ] 3,840 10,368 48 B 480 B 160 Trillion 5 1K [ 6 racks ] 960 2,592 12 B 120 B 42 Trillion 4 4XL [ Full rack ] 160 432 2 B 20 B 7 Trillion 3 PB [ 1/2 rack ] 80 288 1.6 B 16 B 6 Trillion 2 XL [ 1/3 rack ] 40 144 600 M 6 B 2 Trillion 1 L [ 1/4 rack ] 24 72 300 M 3 B 1 Trillion M [ 2 RU ] 12 36 150 M 1.5 B 500 B S [ 2RU ] 6 36 150 M 1.5 B 500 B XS [ 2RU ] 4 36 150 M 1.5 B 500 B OnLine Model Description EDMT Solution Models and Specifications # cores Disk Size (TB)
IV. In the Trenches with Big Data Big Data Challenges in E Discovery Volume Heterogeneity Distributed
IV. In the Trenches with Big Data, cont d The EDRM Model and Big Data Information Management: The problem of silosearching. The need for speed, completeness, comprehensiveness. Identification the problems with the traditional custodian and keyword approaches. National Day Laborer Org. v. US, 2012 WL 2878130 (SDNY) Preservation and Collection problems with custodian based preservation. Sanctions order in Apple v. Samsung
IV. In the Trenches with Big Data, cont d Case Study: Cross Analysis of Structured and Unstructured Data in the OnlineNIC Trilogy Facts: Yahoo, Microsoft and Verizon sued OnlineNIC, one of the largest domain name registrars in the world for cybersquatting. Alleged that defendant was using aliases to register infringing domains. Defense used BMMsoft technology to collect and cross correlate registration database and emails to show that the registrants were not affiliated with defendant Yahoo and Verizon cases settled, Microsoft case was voluntarily dismissed.
Conclusion 1. Big Data is real. 2. Big Data poses unique issues for E Discovery 3. Technology solutions need to be big, and unify structured and unstructured data into a single repository.
THANK YOU Paul Krneta paulk@bmmsoft.com Perry J Narancic perryn@bmmsoft.com
BACKUP SLIDES
EDMT extracts benefits from Big Data EDMT advantages: Scalability & Performance BI (SQL) Enterprise Ready Fast deployment, Easy maintenance Use cases: FLAC (Fraud, Legal, Audit, Compliance) BI & DW: Better Customer Perspective IT: Operational Efficiency FLAC (Text) EDMT Big Data 2.0