The Lab and The Factory Architecting for Big Data Management April Reeve DAMA Wisconsin March 11 2014 1
A good speech should be like a woman's skirt: long enough to cover the subject and short enough to create interest. Winston Churchill 2
April Reeve Twenty five years doing data oriented stuff Data Management disciplines Data Integration, Data Governance, Data Modeling, Data Quality, Business Intelligence, Master Data Management, Data Conversion, Data Warehousing, Enterprise Content Management, Big Data Management Currently implementing Data Governance programs and developing Big Data Strategies for Life Sciences and Financial Services organizations Certifications Certified Data Management Professional (DAMA) Certified Data Governance and Stewardship Professional (DGSP) Certified Business Intelligence Professional (CBIP) Certified in Enterprise Governance of IT (ISACA) Certified Information Systems Auditor (ISACA) Masters degree in Financial Management (financial risk management, derivatives pricing, corporate finance) Book Managing Data in Motion Data Integration Best Practice Techniques and Technologies 3
Agenda Big Data The Data Scientist environment for predictive analytics the Lab Operationalizing predictions the Factory How does it fit with legacy data management architecture? 4
Analytics Maturity From Data to Information on Demand 5
More than just about data volume, smart big data strategies also consider the velocity, variety, and complexity of information New insights on customers, products, and operations Velocity Volume Contextual and location-aware delivery to any device Variety Complexity Documents Transactional Data Smart Grid Images Audio Text Video Volume: data volumes approaching multiple petabytes Velocity: data being generated and ingested for analysis in real-time Variety: tabular, documents, e-mail, metering, network, video, image, audio Complexity: different standards, domain rules, and storage formats per data type Gartner March 2011 6
Big Data Goal: More, Faster, Better Data for Purpose Area Latency Enrichment Query Purpose Analytics Result Revolution No time to read. In-memory is the new DB Tagging is the new Transformation Federated Query is the new ETL Purposeful View is the new Master Predictive is the new Reactive Trigger Action is the new Decision Support 7
Predictive Analytics The Data Scientist chooses Internal and External data (lots of it!) and throws into an Analytical Sandbox The Data Scientist identifies patterns in the data and develops predictive models of behavior involving combining historical information concerning a customer and real time data flows 8
What is Data Science? Data Science refers to the scientific method: The scientist (Data Scientist) develops a hypothesis (model of behavior) Using a large amount of historical data and statistical analysis, the Data Scientist attempts to prove that the model is accurate for predicting behavior 9
Leveraging Big Data for Action Predictive Analytics 10
Leveraging Big Data for Action The organization develops software which populates models using historical customer information and installs into the operational reporting environment Real time processing combines customer information with a real time data stream, which can trigger automatic processes and alerts 11
Leveraging Big Data for Action Streaming Data / Extreme Transaction Processing 12
Big Data Analytics Architecture In Big Data management we need: A Lab or Sandbox environment that is very dynamic and can be used by the Data Scientists to throw in or throw away massive amounts of structured and unstructured data against which to do analysis, find patterns and insights, and develop models An operational Information Factory with all the good production processes we ve learned around data access security and high volume efficiency to produce insight and trigger action on an on-going basis. This Factory also needs to be able to process structured data, unstructured data, and data streams, thus requiring a Big Data architecture that will include, among other things: relational and NoSQL databases, unstructured data stores, and in-memory databases, as well as the ability to process and trigger action. 13
New Data Hubs The Analytical Sandbox & NoSQL Data Stores Structured BI Reporting Environment ETL DW ALL data fed into Hadoop Data Store Hadoop Data Store Data Preparation and Enrichment Exploratory Analytic Environment Analytic Sandbox 14
Data Latency Spectrum Use Case Time Interval Ultra low latency messaging < 100 microseconds Extreme transaction processing < 1 millisecond Streaming data analysis; no intermediate persistence < 100 milliseconds Real time event characterization < 1 second Complex event processing; near real-time dashboards < 30 seconds Operational dashboard < 5 minutes Intraday analysis < 2 hours Daily rollup 24 hours Recent historical analysis 8 days Medium-term historical analysis 13 months Long-term historical analysis 5 years 15
Considerations in Organizing People The Lab In their search for new insights, data scientists write enormous quantities of code. But it is not designed to meet commercial standards for scalability, security, and stability. You create and support commercial-grade code in the factory. The Factory The [Factory] requires many more people with a wider variety of skill sets, a more rigid environment, and different sorts of metrics. To be clear, creativity and experimentation are important in the factory, but you must not expect more than incremental thinking and production-oriented solutions. From Article by Thomas C. Redman and Bill Sweeney in Harvard Business Review 16
Big Data Analytics Architecture 17
Contact Information April Reeve EMC Consulting Enterprise Information Management Practice April.Reeve@emc.com +1 (201) 396-1831 @Datagrrl on Twitter Blog - http://infocus.emc.com/april_reeve/ Book - Managing Data in Motion Data Integration Best Practice Techniques and Technologies 18
THANK YOU 19