A RoadMap to Data Science Dr. Geoffrey Malafsky CEO, Phasic Systems Inc. www.phasicsystemsinc.com 703-945-1378
2 About the Speaker Geoffrey Malafsky, Ph.D, Founder and CEO, former scientist Nanotechnology researcher (Naval Research Laboratory) Technology advisor and sleuth DARPA MEMS Situational Awareness via real-time information fusion Office of Naval Research MEMS Littoral sensors Dept of Energy: Nanotechnology dual use Applying science to difficult data challenges as consultant, analyst, system developer
3 What is Data Science? Latest in long line of hot IT topics IT follows Neil Young: It is better to burn out than it is to rust Data Science is different than past IT hot spots Science binds it to a well structured culture, procedures, and ethics Science is fundamentally rigorous in maintaining auditable, open lineage of data collection, data rationalization, data analysis, theory comparison, adjudicating possible scenarios, and making conclusions Data Science is not analytics, Business Intelligence, warehouse design, Big Data, Cloud whatever, Hadoop,.
4 Big or Small Data: It Is the Quality That Counts Social media analysis, Big Beautiful Data: See Our Social Exchange from Twitter to CNN, Kristina Farrah, 2April2012, http://siliconangle.com/blog/2012/04/02/big-beautiful-data-see-our-social-exchange-from-twitter-to-cnn/
5 Data Science As A Form of Science Study scientific method (Encyclopædia Britannica, Inc.) mathematical and experimental techniques employed in the natural sciences; more specifically, techniques used in the construction and testing of scientific hypotheses. Many empirical sciences, especially the social sciences, use mathematical tools borrowed from probability theory and statistics, together with such outgrowths of these as decision theory, game theory, utility theory, and operations research. Philosophers of science have addressed general methodological problems, such as the nature of scientific explanation and the justification of induction.
6 Data Science From A Practitioner Mike Loudikes, What is Data Science?, 2June2010, http:// radar.oreilly.com/ Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdisciplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: here s a lot of data, what can you make from it?
7 Our Data Science Principles Data Science is the field applying the scientific method to data collection, management, analysis, and reporting as a single integrated environment for general business purposes Rely on well known and practiced methods of data collection, correction, integration, pedigree tracking, quality assurance, statistical analysis, model design and testing, tabular and graphical presentation, and visible traceability of conclusions through all analysis and conclusion steps Embrace uncertainty and transparency
8 Data Science Roadmap Understand what it is and is not (ignore the cacophony of charlatans and certificate mills) Identify high value insights (note not BI nor reports) to your C- executives that they want and can turn into action This makes Data Science applied instead of basic Start small; plan a big win; find a senior management champion; don t wait for organizational clearance (they are waiting for you to succeed or fail first); be prepared for significant resistance and civil disobedience (work around) Continuously communicate that the win is a win for everyone and no one has to give up control Package results in extremely pretty and informative visualizations (see Tufte for some of the best)
9 Foundations Data collection Multi-source: warehouse, external structured sets, unstructured high volume (email, social media), images, sensors, metadata Multi-format Raw versus refined and corrected Data rationalization Continuous cleaning, correcting, aligning, adjudicating Little errors grow exponentially; little garbage in à large garbage out
10 Foundations Data analysis Multi-technique: statistics, models, graphical, linear/non-linear equations Understand the scope, limits, and biases or each technique, especially statistics (be skeptical) Making conclusions You will likely be wrong 80% of the time this is a good thing Keep it to yourself until you challenge, probe, rebuke, debunk Make sure you can support every contention you make you hard facts and figures, or clear valid analysis steps Presenting results Show the main results as simply as possible Keep the interesting (to you) results and analyses as backup
11 Focus on Data Rationalization Most data environments are badly misaligned with semantically unknown relationships and value conflicts There will never be perfect data but you cannot even start doing analysis until you control your data and understand the good, bad, and untrusted Data Rationalization is the process of building and managing a continuously adaptive data environment that fuels current and future business needs for decision making and system operations
12
13
14 The Ψ KORS System Model Point-select data models, codes, entities
15 Corporate NoSQL
16 Different Meanings (Legal and Business Activities) NKY HomeSeekers Texas Example solution: 1. Create table title aligned to business = Garage 2. Create vocabulary for distinct use cases system, value analysis, business use = (spaces, spaces.description, spaces.national, spaces.state, listingservice,.) 3. Define ETL logic 4. Merge in warehouse and process in virtualization layer 5. Change as needed
17 Summary Data Science is new and exciting It is an excellent career opportunity for explorers with discipline and a continuous zeal for investigation and uncovering important new insights To succeed, the result must be important to a senior decision maker Get champion at beginning by making business case of big win for small investment Expect resistance and work to turn nay into yay with constant no one loses communication Use clear, concise attractive graphics to get people excited
18 More Look for in-depth learning webinar on Data Science and Data Rationalization New PSI-KORS Foundation will promulgate noncommercial use of Ψ KORS metamodel and Corporate NoSQL Contact us to bring success into your career and organization