DST4L Class Notes: April 4, 2013 Presenter: David Dietrich



Similar documents
Big Data Analytics. David Dietrich, EMC Education Services. April 4, 2013

Tips to ensuring the success of big data analytics initiatives

INDEX. Introduction Page 3. Methodology Page 4. Findings. Conclusion. Page 5. Page 10

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

Stand OUT Stay TOP of mind Sell MORE

The 4 Pillars of Technosoft s Big Data Practice

Top 3 Ways to Use Data Science

We are Big Data A Sonian Whitepaper

How To Turn Big Data Into An Insight

Collaborations between Official Statistics and Academia in the Era of Big Data

A Simple Guide to Churn Analysis

Big Data Integration: A Buyer's Guide

Travel agents guide to SMS messaging. How to use SMS messaging for marketing, booking, and customer support

THE ULTIMATE WORKSHEET TO JUMP-START YOUR FIRST LINKEDIN LEAD-GENERATION CAMPAIGN

Optimized Hadoop for Enterprise

Social Business Intelligence For Retail Industry

Analytics For Everyone - Even You

What is Data Science? Girl Develop It! Meetup Renée M. P. Teate, March 2015

Stand OUT Stay TOP of mind Sell MORE

Databricks. A Primer

The role of big data in medicine

THE BIGGER THE DATA THE STRONGER THE STORY FIVE STEPS TO BREAKING DOWN BIG DATA INTO ACTIONABLE INSIGHTS

Quick Guide to Getting Started: Twitter for Small Businesses and Nonprofits

The Myth of the Infinite Selling Universe

The Basics of Promoting and Marketing Online

Ramesh Bhashyam Teradata Fellow Teradata Corporation

Unlocking The Value of the Deep Web. Harvesting Big Data that Google Doesn t Reach

Insight Data Science: Bridging the gap between academia and industry. Josiah Walton Physics Careers Seminar UIUC Department of Physics April 23, 2015

8 WAYS TO BUILD YOUR BRAND USING SOCIAL MEDIA

DEVELOPING A SOCIAL MEDIA STRATEGY

Big Data and Healthcare Payers WHITE PAPER

Conducting a Successful Cloudmarket CIO

4 ways to grow your small business with Salesforce CRM

Copyright and Co-sponsorship statement

Databricks. A Primer

Data Science at the University of Virginia

SMS Messaging Guide for Schools, Universities, and Educational Professionals

Discover How a 360-Degree View of the Customer Boosts Productivity and Profits. eguide

What You Need to Know About the Future of Data-Driven Marketing

Sunnie Chung. Cleveland State University

A full spectrum of analytics you can get yourself

FIVE STEPS FOR DELIVERING SELF-SERVICE BUSINESS INTELLIGENCE TO EVERYONE CONTENTS

Your Complete Social Intranet Buyer s Guide & Handbook

How To Use Social Media To Improve Your Business

web analytics ...and beyond Not just for beginners, We are interested in your thoughts:

SECURITY MEETS BIG DATA. Achieve Effectiveness And Efficiency. Copyright 2012 EMC Corporation. All rights reserved.

Big Data Readiness. A QuantUniversity Whitepaper. 5 things to know before embarking on your first Big Data project

Five Reasons Spotfire Is Better than Excel for Business Data Analytics

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

A future career in analytics

Market Research with Social Media

Big Data Executive Survey

Big Data. Fast Forward. Putting data to productive use

IRMAC SAS INFORMATION MANAGEMENT, TRANSFORMING AN ANALYTICS CULTURE. Copyright 2012, SAS Institute Inc. All rights reserved.

Big Data a threat or a chance?

Internet Marketing Rules!

Big Data and Data Analytics

SOCIAL MEDIA & THE JOB SEARCH. Using Today s Most Popular Online Communities for Job-Hunting

SEO How to Get Top Search Engine Rankings in Local Markets

VIEWPOINT. High Performance Analytics. Industry Context and Trends

One View Of Customer Data & Marketing Data

Your guide to using new media

Big Analytics: A Next Generation Roadmap

Grab some coffee and enjoy the pre-show banter before the top of the hour!

5 Point Social Media Action Plan.

SEO 360: The Essentials of Search Engine Optimization INTRODUCTION CONTENTS. By Chris Adams, Director of Online Marketing & Research

How to Choose the Right Web Design Company for Your Nonprofit

REPUTATION MANAGEMENT. Opinions Really Do Matter 3. How Your Contracting Business s Reputation Affects Your Bottom Line 3. The Whole Kit & Caboodle 5

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

REDEFINE BIG DATA: BECOMING A DATA-DRIVEN BUSINESS WITH THE EMC BUSINESS DATA LAKE CHRIS HARROLD, GLOBAL CTO, BIG DATA SOLUTIONS

Building a Better Business Process

Website Promotion for Voice Actors: How to get the Search Engines to give you Top Billing! By Jodi Krangle

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

ANALYTICS CENTER LEARNING PROGRAM

Create and Drive Big Data Success Don t Get Left Behind

Social Media and Content Marketing.

8 Ways To Build Your Brand Using Social Media

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

Online and Social Media Marketing Certificate Program. Syllabus

Microsoft Business Intelligence

Bruhati Technologies. About us. ISO 9001:2008 certified. Technology fit for Business

Data Science and Business Analytics Certificate Data Science and Business Intelligence Certificate

Introduction to Predictive Analytics. Dr. Ronen Meiri

SAP Predictive Analytics

Microsoft Big Data. Solution Brief

5 Big Data Use Cases to Understand Your Customer Journey CUSTOMER ANALYTICS EBOOK

Leveraging Global Media in the Age of Big Data

Whitepaper Series. Search Engine Optimization: Maximizing opportunity,

How To Handle Big Data With A Data Scientist

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Louis Gudema: Founder and President of Revenue + Associates

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

IFF SOCIAL MEDIA GUIDE

Your guide to finding a job

The mobile opportunity: How to capture upwards of 200% in lost traffic

Measure Social Media like a Pro: Social Media Analytics Uncovered SOCIAL MEDIA LIKE SHARE. Powered by

MPA Careers presentation 03/02/2015

SEO for Financial Advisors

The Dating Guide to SEO

Transcription:

DST4L Class Notes: April 4, 2013 Presenter: David Dietrich David Dietrich Advisory Technical Consultant, Data Science EMC Education Services Twitter: @imdaviddietrich Blog: http://infocus.emc.com/author/david_dietrich/ Kaggle http://www.kaggle.com/ Data science is a sport They go to organizations, uncover thorny data problems and crowdsource them Competitions, badges Examples: NYTE http://senseable.mit.edu/nyte/visuals.html New York Talk Exchange (NYTE) Telecommunications data from AT&T 1) Social graph: who calling whom, when, where In NY, broken down at block level Could look at bibliometrics, library data in this way: what are the interactions? who s recommending what things to what kinds of people? 2) Calling hotspots around the globe With this information, AT&T can make sure they have enough staffing in the hubs in the world, add cell towers, etc. Tips Try to tell a story about the data Match the right graph to the kind of data you ve got (worst: dense table) Be thoughtful about how you choose to portray it Example: Spread of Ideas Using Social Graphs to Map the Spread of Innovation Ideas (see slides for visualization) What are the relationships among winners and finalists? Between ideas and people submitting? Visualization created in R using igraph, ggplot

Bibliometrics What are people doing with bibliometrics? Network citation analysis Flow of citations and research Making predictions about what s a good/weak paper based on networks Trying to trace flow of ideas ngrams, text strings how does this span literature in different fields? Example: Healthcare 1) Problem: how to distribute vaccines for pandemics Search tweets to find potential patients Identify infection patterns, make maps What are the patterns, changes? Healthmap.org: http://healthmap.org/en/ Name for this: infodemiology Example: Telecom 2) Problem: churn (marketing term meaning turnover/attrition) Pretty easy to switch providers Unhappy customers complain about quality of service Companies typically run regression analysis to find out how likely people are to churn Approach with big data: analyze call history data, treat call history as a social network Complaining on social media: churn chattering Knowing two customers calling networks could have prevented 5 more from leaving High risk cell phone churners (customers at center of a big social network) can be identified automatically in 1 hour Solutions: can make high risk churners a priority when they call, make it attractive for them to stay, try to keep them from leaving Example: Financial Services 3) Typical problem in loan processing: how to underwrite loans Publicly available data can help make decisions: Zillow, census data, localized job market trends, geographical hazard risk, historical loan data, professional and social history of applicant Continually surprised how far people take this last part. e.g., in middle of a loan, Facebook status changes from married to single is this an issue? Privacy What Dietrich struggles with personally is privacy aspects Amazon can track everything about the buying decision you re making, including how long you look at something People he knows think you re paying for Gmail with your privacy Started Facebook account but doesn t use it because they treat your data as their asset But it s not all sinister

Class Discussion Participant: Can people game the system? Dietrich: People try, but it s like an arms race Participant: Would changing to an opt- in model have a big impact on big data? Dietrich: It s a trade- off: there are a lot of things recommendations do well For now, look at browser called Tor, which will cloak your location: https://www.torproject.org/ See also Collusion (Chris s recommendation): http://www.mozilla.org/en- US/collusion/ And Ghostery: http://www.ghostery.com/ Participant: Some people use social media only for professional profile Fine line between convenient and scary Google wants you to take Google around with you (Glass) Participant: Concerned about data taken out of context being used to inform decisions that affect people s quality of life (e.g., loan processing decisions influenced by amount of crime in a homebuyer s area) Dietrich: Can be dangerous Need people with content knowledge What Constitutes Data? Now anything is fair game to be called data in big data New sources of data What s driving this data growth: mobile sensors, surveillance, genomic sequencing, social media People expect to analyze huge amounts of data quickly Requires new platforms, roles, techniques Example: Genetic Testing 23andMe: https://www.23andme.com/ Discover lineage, chance of going bald, likelihood of contracting disease, likely length of life Now partnered with Ancestry.com Big Data Definition of big data: datasets so large they break traditional IT infrastructures Structured/Unstructured Data Methods in place for working with structured data Focus moving to quasi- and unstructured data Structured data: relational databases Semi- structured: XML Quasi- structured: click- streams, not such regular tags, more work to parse and impose structure

Unstructured: e.g., poetry with no punctuation (most of growth, most work to be done here) New Ecosystem Around Big Data Data devices: creating data through sensors or through humans interacting with them Data collectors: government, hospitals, retail Data aggregators: infochimps (crawls webs, aggregates datasets, dozens of datasets, some free, some for fee) Data users/buyers Early Adopters Retail far out in front, masters in this stuff Same for financial services Government has done quite a lot Everyone doing it and using more and more sophisticated tools Universe of things you can do with it has grown Now can t think of an industry not doing this Drivers 1) Optimize business operations 2) Identify business risk 3) Predict new business opportunities 4) Comply with laws or regulatory requirements (how to comply, how to demonstrate they ve complied) Business Intelligence vs. Data Science Business intelligence: creating data cubes, rapid querying of very structured datasets, reporting, dashboards, lots of queries Data science: data mining, data analytics, predictive modeling, forecasting; data can be anything, including unstructured/mixed Building a Data Science Team Data science is a team sport Need diversity of skills to solve problems well May not need seven people may need 3, may need 50 Business intelligence analysts: generally know the data really well Database administrator: set up and configure database but may not be good at working with data Data engineer: complex queries, SQL good data engineer very hard to find and valuable Data scientist: creative ways to solve problems might not be great at or like engineering work Data Analytics Lifecycle Big temptation to jump to model building end up sliding back Discovery: clear problem definition, understand stakeholders, create hypotheses

Data prep: condition data, evaluate quality, is it normalized? this is where you ll spend lion s share of time, at least 80% If you can munge and massage data, the universe of data you can work with explodes and makes these projects much better Model planning: is this a clustering/classification/etc. problem? Communicate wins, let others blow holes in it as a way to improve Operationalize: real- time logic on analytical engines Data Sources for Analytic Projects Organizations are used to dealing with same systems, databases, tools Think broadly, open- mindedly: in an ideal world, how would solve the problem? A lot of things in the wild you can get your hands on to make your analysis much better Tools and Technologies You want to have a lot of tools (methods, technologies) in your bag Diverse data need to attack it in a lot of ways R hugely popular among data scientists on Kaggle Python also tremendously popular, one of most versatile because people are building ecosystems around Python After R, Matlab next most popular on Kaggle list Excel way at bottom NOSQL (Not Only SQL) newer database architectures MongoDB used by RecordedFuture.com Where This Is Going Embedding analytical intelligence into computing advancing what s possible Simplifying big data all money being made is here Don t see demand going away; think it s going to grow McKinsey Report McKinsey Global Institute report: Big Data: The Next Frontier for Innovation, Competition, and Productivity, May 2011 http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_i nnovation Industry Hiring Needs Deep analytical talent (data scientists): projected US talent gap: 140K 190K (applied math, economics, life science) Data- savvy professionals: projected US talent gap: 1.5 mil (at least one class in statistics) People who know how to think about data Technology & data enablers : already there Numbers above now considered way to small

Profile of a Data Scientist Quantitative Technical: knows more stats than engineers and more engineering than statisticians; can write algorithms Skeptical: Is the solution viable? Communicative, collaborative Curious & creative: this is the most pivotal thing and the hardest thing to teach; someone who can ask really good questions Specific Data Science Skills and Traits Necessary qualification identified by DJ Patil, who started data scientist team at LinkedIn: would I be willing to go into a startup with you? Self- motivated Not afraid of learning math and new technologies Knowledge of at least one domain area Find ways to apply data science methods in their current roles Formal Training EMC Data Science & Big Data Analytics course: https://education.emc.com/guest/campaign/data_science.aspx STEM (science, technology, engineering, math) graduate programs and certificates Conferences on analytics (often post content online): Strata, PAW, ACM, ACL, INFORMS Massive Open Online Courses (MOOCs) Informal Training Look for opportunities to try out your skills, Offer to help on projects Leverage wisdom of crowds: social media, meetups Volunteer to help: Datakind.org Try contests: Kaggle.com, Innocentive.com Applying This to the Library Domain Look for opportunities to drive new value as a data scientist/data- savvy librarian What do you want to do? Map the following of ideas in research literature? Use citation networks to identify the most influential researcher? Predict award- winning research papers? This is partly based on citation mapping & social network techniques Increase collaboration with researchers and faculty? Challenge traditional thinking using analytics? Chris mentioned https://republicofletters.stanford.edu/

Recommended Reading Kahneman, Daniel. Thinking Fast and Slow. http://www.amazon.com/thinking- Fast- and- Slow- ebook/dp/b004r1q2eg Barbasi, Albert- Laszlo. Linked: How Everything Is Connected to Everything Else and What It Means. http://www.amazon.com/linked- Everything- Connected- Else- Means/dp/0452284392 Interesting read and very readable Many examples of network science and its evolution David s blog on data science and big data analytics http://infocus.emc.com/author/david_dietrich/ Blog on applying data analytics lifecycle to measuring innovation data http://stevetodd.typepad.com/my_weblog/data- science- and- big- data- curriculum/ EMC Education Services curriculum on big data https://education.emc.com/guest/campaign/data_science.aspx Berns, Gregory. Iconoclast. http://www.amazon.com/iconoclast- Neuroscientist- Reveals- Think- Differently/dp/1422133303 Attributes of visionaries: Unique perception see problems in new ways Social intelligence and awareness No fear of failure willing to try and take risks If you don t have these attributes how do try to cultivate them Stimulate through novelty: jar yourself out of routines and fast thinking into deliberative thinking Duhigg, Charles. The Power of Habit: Why We Do What We Do in Life and Business. http://www.amazon.com/power- Habit- What- Life- Business/dp/1400069289