1 Big Data Analytics Best Practices Marshall Presser Federal Field CTO Greenplum
2 Big Data Makes the Mainstream
3 WHAT DOES IT TAKE?
4 1. New Applications MADlib
5 2. New Skill Sets -- Data Science
6 3. The Right Platforms Structured and Unstructured Data Clusters
7 The Goal: The Predictive Enterprise Predictive Enterprise Data Driven Decisions Deliver maximum business value from all the available data Predict outcomes using advanced analytics Leverage data science to gain deep insight about the business Turn insight into action with new applications
8 Federal Agency Requirements for Big Data Analytics Intelligence Community Counter-Terrorism Counter-Intelligence Cyber-Security Intelligence Analysis Department of Defense Data to Decisions Reduce the cycle time and manpower requirements Cyber Science & Technology Efficient and effective cyber capabilities Counter Weapons of Mass Destruction Secure, monitor, track and eliminate weapons of mass destruction Financials Fraud Detection Insider Trading Risk Analytics Homeland Security Identity Verification Transportation Security Border Security Immigration Control Investigations on Massive amounts of data Maritime Domain Awareness Department of Justice Counter-terrorism & Foreign Intelligence Defense Organized Crime Investigations Drug Enforcement & Illicit Drug Traffic Reduction Healthcare & Citizen Benefits Fraud, Waste & Abuse Detection Accurate Patient Identification & Treatment Accurate Benefit Distribution and Monitoring Healthcare Exchanges
9 Big Data Initiatives in the US Federal Government Economic forecasting mortgage foreclosures Health economics fraudulent claim analytics Internet security web log analytics Climatology numeric weather forecasting and storm path prediction Nuclear energy simulations of subatomic reactions, power from fusion Healthcare individually based optimal treatment patterns Genetics drug therapy, Human Genome project Medicine advanced imaging techniques Government operations waste, fraud, abuse, optimal operations
10 To Hadoop or Not to Hadoop? SQL: strong eco-structure, rich tools set, large developer community, very efficient on structured data Hadoop: more versatile on unstructured data, cost-efficient, schema on read
11 Hadoop is not Nirvana But there are many people using it despite the problems Data movement in/out of HDFS cumbersome Name Node Failure Written in Java, performance issues 3x data duplication wasteful Code base immature compared to SQL Management and admin not well developed Not a lot of Map/Reduce expertise Performance can be erratic, sub-standard
12 ETL The Hadoop Killer App? Data Processing on Hadoop Data Volume or Lack of Structure overwhelm ETL tools Use Hadoop to process transformations on raw data Load summarized data into analytical database (GPDB) Leverage the best of RDBMS & NoSQL Integration of Structured & Unstructured Data Tackle Petabyte Scale Datasets Sensor networks, social applications, online advertising apps New Data for Ad Hoc Analysis & Modeling Social Media Sentiment analysis, Online advertising optimization, Computer security
13 Hybrid Solutions Using SQL and Hadoop in a single application Raw Data Relational Text Video Audio Logfiles Hadoop Cluster Interesting Stuff Greenplum MPP Database Archive
14 How To Get Started Small manageable first project Find a first problem that is important, but not mission critical. Show success, ROI. Take an existing application that is too slow or not answering questions. Involve LOB users from the beginning. Set reasonable expectations. Avoid extensive coding, development for first project. Don t boil the ocean. Hire outside expertise; train your staff.