DST4L Class Notes: April 4, 2013 Presenter: David Dietrich David Dietrich Advisory Technical Consultant, Data Science EMC Education Services Twitter: @imdaviddietrich Blog: http://infocus.emc.com/author/david_dietrich/ Kaggle http://www.kaggle.com/ Data science is a sport They go to organizations, uncover thorny data problems and crowdsource them Competitions, badges Examples: NYTE http://senseable.mit.edu/nyte/visuals.html New York Talk Exchange (NYTE) Telecommunications data from AT&T 1) Social graph: who calling whom, when, where In NY, broken down at block level Could look at bibliometrics, library data in this way: what are the interactions? who s recommending what things to what kinds of people? 2) Calling hotspots around the globe With this information, AT&T can make sure they have enough staffing in the hubs in the world, add cell towers, etc. Tips Try to tell a story about the data Match the right graph to the kind of data you ve got (worst: dense table) Be thoughtful about how you choose to portray it Example: Spread of Ideas Using Social Graphs to Map the Spread of Innovation Ideas (see slides for visualization) What are the relationships among winners and finalists? Between ideas and people submitting? Visualization created in R using igraph, ggplot
Bibliometrics What are people doing with bibliometrics? Network citation analysis Flow of citations and research Making predictions about what s a good/weak paper based on networks Trying to trace flow of ideas ngrams, text strings how does this span literature in different fields? Example: Healthcare 1) Problem: how to distribute vaccines for pandemics Search tweets to find potential patients Identify infection patterns, make maps What are the patterns, changes? Healthmap.org: http://healthmap.org/en/ Name for this: infodemiology Example: Telecom 2) Problem: churn (marketing term meaning turnover/attrition) Pretty easy to switch providers Unhappy customers complain about quality of service Companies typically run regression analysis to find out how likely people are to churn Approach with big data: analyze call history data, treat call history as a social network Complaining on social media: churn chattering Knowing two customers calling networks could have prevented 5 more from leaving High risk cell phone churners (customers at center of a big social network) can be identified automatically in 1 hour Solutions: can make high risk churners a priority when they call, make it attractive for them to stay, try to keep them from leaving Example: Financial Services 3) Typical problem in loan processing: how to underwrite loans Publicly available data can help make decisions: Zillow, census data, localized job market trends, geographical hazard risk, historical loan data, professional and social history of applicant Continually surprised how far people take this last part. e.g., in middle of a loan, Facebook status changes from married to single is this an issue? Privacy What Dietrich struggles with personally is privacy aspects Amazon can track everything about the buying decision you re making, including how long you look at something People he knows think you re paying for Gmail with your privacy Started Facebook account but doesn t use it because they treat your data as their asset But it s not all sinister
Class Discussion Participant: Can people game the system? Dietrich: People try, but it s like an arms race Participant: Would changing to an opt- in model have a big impact on big data? Dietrich: It s a trade- off: there are a lot of things recommendations do well For now, look at browser called Tor, which will cloak your location: https://www.torproject.org/ See also Collusion (Chris s recommendation): http://www.mozilla.org/en- US/collusion/ And Ghostery: http://www.ghostery.com/ Participant: Some people use social media only for professional profile Fine line between convenient and scary Google wants you to take Google around with you (Glass) Participant: Concerned about data taken out of context being used to inform decisions that affect people s quality of life (e.g., loan processing decisions influenced by amount of crime in a homebuyer s area) Dietrich: Can be dangerous Need people with content knowledge What Constitutes Data? Now anything is fair game to be called data in big data New sources of data What s driving this data growth: mobile sensors, surveillance, genomic sequencing, social media People expect to analyze huge amounts of data quickly Requires new platforms, roles, techniques Example: Genetic Testing 23andMe: https://www.23andme.com/ Discover lineage, chance of going bald, likelihood of contracting disease, likely length of life Now partnered with Ancestry.com Big Data Definition of big data: datasets so large they break traditional IT infrastructures Structured/Unstructured Data Methods in place for working with structured data Focus moving to quasi- and unstructured data Structured data: relational databases Semi- structured: XML Quasi- structured: click- streams, not such regular tags, more work to parse and impose structure
Unstructured: e.g., poetry with no punctuation (most of growth, most work to be done here) New Ecosystem Around Big Data Data devices: creating data through sensors or through humans interacting with them Data collectors: government, hospitals, retail Data aggregators: infochimps (crawls webs, aggregates datasets, dozens of datasets, some free, some for fee) Data users/buyers Early Adopters Retail far out in front, masters in this stuff Same for financial services Government has done quite a lot Everyone doing it and using more and more sophisticated tools Universe of things you can do with it has grown Now can t think of an industry not doing this Drivers 1) Optimize business operations 2) Identify business risk 3) Predict new business opportunities 4) Comply with laws or regulatory requirements (how to comply, how to demonstrate they ve complied) Business Intelligence vs. Data Science Business intelligence: creating data cubes, rapid querying of very structured datasets, reporting, dashboards, lots of queries Data science: data mining, data analytics, predictive modeling, forecasting; data can be anything, including unstructured/mixed Building a Data Science Team Data science is a team sport Need diversity of skills to solve problems well May not need seven people may need 3, may need 50 Business intelligence analysts: generally know the data really well Database administrator: set up and configure database but may not be good at working with data Data engineer: complex queries, SQL good data engineer very hard to find and valuable Data scientist: creative ways to solve problems might not be great at or like engineering work Data Analytics Lifecycle Big temptation to jump to model building end up sliding back Discovery: clear problem definition, understand stakeholders, create hypotheses
Data prep: condition data, evaluate quality, is it normalized? this is where you ll spend lion s share of time, at least 80% If you can munge and massage data, the universe of data you can work with explodes and makes these projects much better Model planning: is this a clustering/classification/etc. problem? Communicate wins, let others blow holes in it as a way to improve Operationalize: real- time logic on analytical engines Data Sources for Analytic Projects Organizations are used to dealing with same systems, databases, tools Think broadly, open- mindedly: in an ideal world, how would solve the problem? A lot of things in the wild you can get your hands on to make your analysis much better Tools and Technologies You want to have a lot of tools (methods, technologies) in your bag Diverse data need to attack it in a lot of ways R hugely popular among data scientists on Kaggle Python also tremendously popular, one of most versatile because people are building ecosystems around Python After R, Matlab next most popular on Kaggle list Excel way at bottom NOSQL (Not Only SQL) newer database architectures MongoDB used by RecordedFuture.com Where This Is Going Embedding analytical intelligence into computing advancing what s possible Simplifying big data all money being made is here Don t see demand going away; think it s going to grow McKinsey Report McKinsey Global Institute report: Big Data: The Next Frontier for Innovation, Competition, and Productivity, May 2011 http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_i nnovation Industry Hiring Needs Deep analytical talent (data scientists): projected US talent gap: 140K 190K (applied math, economics, life science) Data- savvy professionals: projected US talent gap: 1.5 mil (at least one class in statistics) People who know how to think about data Technology & data enablers : already there Numbers above now considered way to small
Profile of a Data Scientist Quantitative Technical: knows more stats than engineers and more engineering than statisticians; can write algorithms Skeptical: Is the solution viable? Communicative, collaborative Curious & creative: this is the most pivotal thing and the hardest thing to teach; someone who can ask really good questions Specific Data Science Skills and Traits Necessary qualification identified by DJ Patil, who started data scientist team at LinkedIn: would I be willing to go into a startup with you? Self- motivated Not afraid of learning math and new technologies Knowledge of at least one domain area Find ways to apply data science methods in their current roles Formal Training EMC Data Science & Big Data Analytics course: https://education.emc.com/guest/campaign/data_science.aspx STEM (science, technology, engineering, math) graduate programs and certificates Conferences on analytics (often post content online): Strata, PAW, ACM, ACL, INFORMS Massive Open Online Courses (MOOCs) Informal Training Look for opportunities to try out your skills, Offer to help on projects Leverage wisdom of crowds: social media, meetups Volunteer to help: Datakind.org Try contests: Kaggle.com, Innocentive.com Applying This to the Library Domain Look for opportunities to drive new value as a data scientist/data- savvy librarian What do you want to do? Map the following of ideas in research literature? Use citation networks to identify the most influential researcher? Predict award- winning research papers? This is partly based on citation mapping & social network techniques Increase collaboration with researchers and faculty? Challenge traditional thinking using analytics? Chris mentioned https://republicofletters.stanford.edu/
Recommended Reading Kahneman, Daniel. Thinking Fast and Slow. http://www.amazon.com/thinking- Fast- and- Slow- ebook/dp/b004r1q2eg Barbasi, Albert- Laszlo. Linked: How Everything Is Connected to Everything Else and What It Means. http://www.amazon.com/linked- Everything- Connected- Else- Means/dp/0452284392 Interesting read and very readable Many examples of network science and its evolution David s blog on data science and big data analytics http://infocus.emc.com/author/david_dietrich/ Blog on applying data analytics lifecycle to measuring innovation data http://stevetodd.typepad.com/my_weblog/data- science- and- big- data- curriculum/ EMC Education Services curriculum on big data https://education.emc.com/guest/campaign/data_science.aspx Berns, Gregory. Iconoclast. http://www.amazon.com/iconoclast- Neuroscientist- Reveals- Think- Differently/dp/1422133303 Attributes of visionaries: Unique perception see problems in new ways Social intelligence and awareness No fear of failure willing to try and take risks If you don t have these attributes how do try to cultivate them Stimulate through novelty: jar yourself out of routines and fast thinking into deliberative thinking Duhigg, Charles. The Power of Habit: Why We Do What We Do in Life and Business. http://www.amazon.com/power- Habit- What- Life- Business/dp/1400069289