1 SEXY SCIENTISTS WRANGLING DATA AND BEGETTING NEW INDUSTRIES Jamie Zawinski Chris Wiggins (The New York Times) Brad Fitzpatrick Caitlin Smallwood (Netflix) Douglas Crockford Guy Steele Amy Heineike (Quid) Dan Ingalls Jonathan Lenaghan (PlaceIQ) L Peter Deutsch Data Scientists at Work Roger Brendan Ehrenberg Eich (IA Ventures) Joshua Bloch Erin Shellman (Nordstrom) Joe Armstrong Victor Hu Simon (Next Peyton Big Sound) Jones John Peter Foreman Norvig (MailChimp) Claudia Perlich (Dstillery) Daniel Tunkelang (LinkedIn) Kira Radinsky (SalesPredict) Ken Thompson Fran Allen Eric Jonas (Independent Scientist) Bernie Cosell Yann LeCun Donald (Facebook) Knuth Anna Smith (Rent the Runway) Jake Porway (DataKind) André Karpištšenko (Planet OS) S e b a s t i a n G u t i e r r e z foreword by peter norvig (G oogle)
2 Contents Foreword by Peter Norvig, Google vii About the Author xi Acknowledgments xiii Introduction xv Chapter 1: Chris Wiggins, The New York Times Chapter 2: Caitlin Smallwood, Netfl ix Chapter 3: Yann LeCun, Facebook Chapter 4: Erin Shellman, Nordstrom Chapter 5: Daniel Tunkelang, LinkedIn Chapter 6: John Foreman, MailChimp Chapter 7: Roger Ehrenberg, IA Ventures Chapter 8: Claudia Perlich, Dstillery Chapter 9: Jonathan Lenaghan, PlaceIQ Chapter 10: Anna Smith, Rent the Runway Chapter 11: André Karpištšenko, Planet OS Chapter 12: Amy Heineike, Quid Chapter 13: Victor Hu, Next Big Sound Chapter 14: Kira Radinsky, SalesPredict Chapter 15: Eric Jonas, Neuroscience Research Chapter 16: Jake Porway, DataKind Index
3 CHAPTER 1 Chris Wiggins The New York Times Chris Wiggins is the Chief Data Scientist at The New York Times (NYT) and Associate Professor of Applied Mathematics at Columbia University. He applies machine learning techniques in both roles, albeit to answer very different questions. In his role at the NYT, Wiggins is creating a machine learning group to analyze both the content produced by reporters and the data generated by readers consuming articles, as well as data from broader reader navigational patterns with the overarching goal of better listening to NYT consumers as well as rethinking what journalism is going to look like over the next 100 years. At Columbia University, Wiggins focuses on the application of machine learning techniques to biological research with large data sets. This includes analysis of naturally occurring networks, statistical inference applied to biological time-series data, and large-scale sequence informatics in computational biology. As part of his work at Columbia, he is a founding member of the university s Institute for Data Sciences and Engineering (IDSE) and Department of Systems Biology. Wiggins is also active in the broader New York tech community, as co-founder and co-organizer of hackny a nonprofi t organization that guides and mentors the next generation of hackers and technologists in the New York innovation community. Wiggins has held appointments as a Courant Instructor at the New York University Courant Institute of Mathematical Sciences and as a Visiting Research Scientist at the Institut Curie (Paris), Hahn-Meitner Institut (Berlin), and the Kavli Institute for Theoretical Physics (Santa Barbara). He holds a PhD in Physics from Princeton University and a BA in Physics from Columbia, minoring as an undergraduate in religion and in mathematics. Wiggins s diverse accomplishments demonstrate how world-class data science skills wedded to extraordinarily strong values can enable an individual data scien-
4 2 Chapter 1 Chris Wiggins, The New York Times tist to make tremendous impacts in very different environments, from startups to centuries-old institutions. This combination of versatility and morality comes through as he describes his belief in a functioning press and his role inside of it, why he values people, ideas, and things in that order, and why caring and creativity are what he looks for in other people s work. Wiggins s passion for mentoring and advising future scientists and citizens across all of his roles is a leitmotif of his interview. Gutierrez: Tell me about where you work. Wiggins: I split my time between Columbia University, where I am an associate professor of applied mathematics, and The New York Times, where I am the chief data scientist. I could talk about each institution for a long time. As background, I have a long love for New York City. I came to New York to go to Columbia as an undergraduate in the 1980s. I think of Columbia University itself as this great experiment to see if you can foster an Ivy League education and a strong scientifi c and research community within the experiment of New York City, which is full of excitement and distraction and change and, most of all, full of humanity. Columbia University is a very exciting and dynamic place, full of very disruptive students and alumni, myself included, and has been for centuries. The New York Times is also centuries old. It s a 163-year-old company, and I think it also stands for a set of values that I strongly believe in and is also very strongly associated with New York, which I like very much. When I think of The New York Times, I think of the sentiment expressed by Thomas Jefferson that if you could choose between a functioning democracy and a dysfunctional press, or a functioning press and a dysfunctional democracy, he would rather have the functioning press. You need a functioning press and a functioning journalistic culture to foster and ensure the survival of democracy. I get the joy of working with three different companies whose missions I strongly value. The third company where I spend my time is a nonprofi t that I cofounded, called hackny, 1 many years ago. I remain very active as the coorganizer. In fact, tonight, we re going to have another hackny lecture, and I ll have a meeting today with the hackny general manager to deal with operations. So I really split my time among three companies, all of whose mission I value: The New York Times and the two nonprofi ts Columbia University and hackny. Gutierrez: How does data science fi t into your work? 1
5 Data Scientists at Work 3 Wiggins: I would say it s an exciting time to be working in data science, both in academia and at The New York Times. Data science is really being birthed as an academic fi eld right now. You can fi nd the intellectual roots of it in a proposal by the computational statistician Bill Cleveland in Clearly, you can also fi nd roots for data scientists as such in job descriptions, the most celebrated examples being DJ Patil s at LinkedIn and Jeff Hammerbacher s at Facebook. However, in some ways, the intellectual roots go back to writings by the heretical statistician John Tukey in There s been something brewing in academia for half a century, a disconnect between statistics as an ever more and more mathematical fi eld, and the practical fact that the world is producing more and more data all the time, and computational power is exponentiating over time. More and more fi elds are interested in trying to learn from data. My research over the last decade or more at Columbia has been in what we would now call data science what I used to call machine learning applied to biology but now might call data science in the natural sciences. There the goal was to collaborate with people who have domain expertise not even necessarily quantitative or mathematical domain expertise that s been built over decades of engagement with real questions from problems in the workings of biology that are complex but certainly not random. The community grappling with these questions found itself increasingly overwhelmed with data. So there s an intellectual challenge there that is not exactly the intellectual challenge of machine learning. It s more the intellectual challenge of trying to use machine learning to answer questions from a real-world domain. And that s been exciting to work through in biology for a long time. It s also exciting to be at The New York Times because The New York Times is one of the larger and more economically stable publishers, while defending democracy and historically setting a very high bar for journalistic integrity. They do that through decades and centuries of very strong vocal self-introspection. They re not afraid to question the principles, choices, or even the leadership within the organization, which I think creates a very healthy intellectual culture. At the same time, though, although it s economically strong as a publisher, the business model of publishing for the last two centuries or so has completely evaporated just over the last 10 years; over 70 percent of print advertising revenue simply evaporated, most precipitously starting around So although this building is full of very smart people, it s undergoing a clear sea change in terms of how it will defi ne the future of sustainable journalism. 2
6 4 Chapter 1 Chris Wiggins, The New York Times The current leadership, all the way down to the reporters, who are the reason for existence of the company, is very curious about the digital, broadly construed. And that means: How does journalism look when you divorce it from the medium of communication? Even the word newspaper presumes that there s going to be paper involved. And paper remains very important to The New York Times not only in the way things are organized the way even the daily schedule is organized here but also conceptually. At the same time, I think there are a lot of very forward-looking people here, both journalists and technologists, who are starting to diversify the way that The New York Times communicates the news. To do that, you are constantly doing experiments. And if you re doing experiments, you need to measure something. And the way you measure things right now, in 2014, is via the way people engage with their products. So from web logs to every event when somebody interacts with the mobile app, there are copious, copious data available to this company to fi gure out: What is it that the readers want? What is it that they value? And, of course, that answer could be dynamic. It could be that what readers want in 2014 is very different than what they wanted in 2013 or So what we re trying to do in the Data Science group is to learn from and make sense of the abundant data that The New York Times gathers. Gutierrez: When did you realize that you wanted to work with data as a career? Wiggins: That happened one day at graduate school while having lunch with some other graduate students, mostly physicists working in biology. Another graduate student walked in brandishing the cover of Science magazine, 3 which had an image of the genome of Haemophilus infl uenzae. Haemophilus infl uenzae is the fi rst sequenced freely living organism. This is a pathogen that had been identifi ed on the order of 100 years earlier. But to sequence something means that you go from having pictures of it and maybe experiments where you pour something on it and maybe it turns blue, to having a phonebook s worth of information. That information unfortunately is written in a language that we did not choose, just a four-letter alphabet, imagine ACGT ACGT, over and over again. You can just picture a phonebook s worth of that. And there begins the question, which is both statistical and scientifi c: How do you make sense of this abundant information? We have this organism. We ve studied it for 100 years. We know what it does, and now we re presented with this entirely different way of understanding this organism. In some ways, it s the entire manual for the pathogen, but it s written in a language that we didn t choose. That was a real turning point in biology. 3
7 Data Scientists at Work 5 When I started my PhD work in the early 1990s, I was working on the style of modeling that a physicist does, which is to look for simple problems where simple models can reveal insight. The relationship between physics and biology was growing but limited in character, because really the style of modeling of a physicist is usually about trying to identify a problem that is the key element, the key simplifi ed description, which allows fundamental modeling. Suddenly dropping a phonebook on the table and saying, Make sense of this, is a completely different way of understanding it. In some ways, it s the opposite of the kind of fundamental modeling that physicists revered. And that is when I started learning about learning. Fortunately, physicists are also very good at moving into other fi elds. I had many culture brokers that I could go to in the form of other physicists who had bravely gone into, say, computational neuroscience or other fi elds where there was already a well-established relationship between the scientifi c domain and how to make sense of data. In fact, one of the preeminent conferences in machine learning is called NIPS, 4 and the N is for neuroscience. That was a community which even before genomics was already trying to do what we would now call data science, which is to use data to answer scientifi c questions. By the time I fi nished my PhD, in the late 1990s, I was really very interested in this growing literature of people asking statistical questions of biology. It s maddening to me not to be able to separate wheat from chaff. When I read these papers, the only way to really separate wheat from chaff is to start writing papers like that yourself and to try to fi gure out what s doable and what s not doable. Academia is sometimes slow to reveal what is wheat and what is chaff, but eventually it does a very good job. There s a proliferation of papers and, after a couple of years, people realize which things were gold and which things were fool s gold. I think that now you have a very strong tradition of people using machine learning to answer scientifi c questions. Gutierrez: What in your career are you most proud of? Wiggins: I m actually most proud of the mentoring component of what I do. I think I, and many other people who grow up in the guild system of academia, acquire a strong appreciation for the benefi ts of the way we ve all benefi ted from good mentoring. Also, I know what it s like both to be on the receiving end and the giving end of really bad and shallow mentoring. I think the things I m most proud of are the mentoring aspects of everything I ve done. 4
8 6 Chapter 1 Chris Wiggins, The New York Times Here at the data science team at The New York Times, I m building a group, and I assure you that I spend as much time thinking hard about the place and people as I do on things and ideas. Similarly, hackny is all about mentoring. The whole point of hackny is to create a network of very talented young people who believe in themselves and believe in each other and bring out the best in themselves and bring out the best in each other. And certainly at Columbia, the reason I m still in academia is that I really value the teaching and mentoring and the quest to better yourself and better your community that you get from an in-person brick-and-mortar university as opposed to a MOOC. Gutierrez: What does a typical day at work look like for you? Wiggins: There are very few typical days right now, though I look forward to having one in the future. I try to make my days at The New York Times typical because this is a company. What I mean by that is that it is a place of interdependent people, and so people rely on you. So I try throughout the day to make sure I meet with everyone in my group in the morning, meet with everyone in my group in the afternoon, and meet with stakeholders who have either data issues or who I think have data issues but don t know it yet. Really, at this point, I would say that at none of my three jobs is there such a thing as a typical day. Gutierrez: Where do you get ideas for things to study or analyze? Wiggins: Over the past 20 years, I would say the main driver of my ideas has been seeing people doing it wrong. That is, I see people I respect working on problems that I think are important, and I think they re not answering those questions the right way. This is particularly true in my early career in machine learning applied to biology, where I was looking at papers written by statistical physicists who I respected greatly, but I didn t think that they were using, or let s say stealing, the appropriate tools for answering the questions they had. And to me, in the same way that Einstein stole Riemannian geometry from Riemann and showed that it was the right tool for differential geometry, there are many problems of interest to theoretical physicists where the right tools are coming from applied computational statistics, and so they should use those tools. So a lot of my ideas come from paying attention to communities that I value, and not being able to brush it off when I see people whom I respect who I think are not answering a question the right way. Gutierrez: What specifi c tools or techniques do you use? Wiggins: My group here at The New York Times uses only open source statistical software, so everything is either in R or Python, leaning heavily on scikit-learn and occasionally IPython notebooks. We rely heavily on Git as version control. I mostly tend to favor methods of supervised learning rather than unsupervised learning, because usually when I do an act of clustering, which is generically what one does as unsupervised learning, I never know if I ve done it the best. I always worry that there is some other clustering that I could do, and I won t even know which of the two clusterings is the better.
9 Data Scientists at Work 7 But with supervised learning, I usually can start by asking: How predictive is this model that we ve built? And once I understand how predictive it is, then I can start taking it apart and ask: How does it work? What does it learn? What are the features that it rendered important? That s completely true both at The New York Times and at Columbia. One of the driving themes of my work has been taking domain questions and asking: How can I reframe this as a prediction task? Gutierrez: How do you think about whether you re solving the right problem? Wiggins: The key is usually to just keep asking, So what? You ve predicted something to this accuracy? So what? Okay, well, these features turned out to be important. So what? Well, this feature may be related to something that you could make a change to in your product decisions or your marketing decisions. So what? Well, then I could sit down with this person and we could suggest a different marketing mechanism. Now you ve started to refi ne and think all the way through the value chain to the point at which it s going to become an insight or a paper or product some sort of way that it s going to move the world. I think that s also really important for working with junior people, because I want junior people always to be able to keep their eyes on the prize, and you can t do that if you don t have the prize in mind. I can remember when I was much younger a postdoc I went to see a great mathematician and I talked to him for maybe 20 minutes about a calculation I was working on, as well as all of the techniques that I was learning. He sat silently for about 10 minutes and then he fi nally said, What are you trying to calculate? What is the goal of this mathematical manipulation you re doing? He was right, meaning you need to be able to think through toward So what? If you could calculate this, if you could compute this correlation function, or whatever else it is that you re trying to compute, how would that benefi t anything? And that s a thought experiment or a chain of thinking that you can do in the shower or in the subway. It s not something that even requires you to boot up a computer. It s just something that you need to think through clearly before you ever pick up a pencil or touch a keyboard. John Archibald Wheeler, the theoretical physicist, said you should never do a calculation until you know the answer. That s an important way of thinking about doing mathematics. Should I bother doing this mathematics? Well, I think I know what the answer s going to be. Let me go see if I can show that answer. If you re actually trying to do something in engineering, and you re trying to apply something, then it s worse than that, because you shouldn t bother doing a computation or collecting a data set or even pencil-and-paper work until you have some sense for So what? If you show that this correlation function scales to T 7/8, so what? If you show that you can predict something to 80-percent accuracy on held-out data, so what? You need to think through how it will impact something that you value.
10 8 Chapter 1 Chris Wiggins, The New York Times Gutierrez: What s an interesting project that you ve worked on? Wiggins: One example comes from 2001 when I was talking to a mathematician whom I respect very much about what he saw as the future of our fi eld, the intersection of statistics and biology, and he said, Networks. It s all going to be networks. I said, What are you talking about? Dynamical systems on networks? He said, Sure, that and statistics of networks. Everything on networks. At the time, the phrase statistics of networks didn t even parse for me. I couldn t even understand what he was saying. He was right. I saw him again at a conference on networks two years later. 5 Many people that I really respected spoke at that conference about their theories of the way real-world networks came to evolve. I remember stepping off the street corner one day while talking to another biophysicist, somebody who was coming from the same intellectual tradition that I had with my PhD. And I was saying, People look at real-world networks, and they plot this one statistical attribute, and then they make up different models all of which can reproduce this one statistical attribute. And they re basically just looking at a handful of predefi ned statistics and saying, Well, I can reproduce that statistical behavior. That attribute is over-universal. There are too many theories and therefore too many theorists saying that they could make models that looked like real-world graphs. You know what we should do? We should totally fl ip this problem on its head and build a machine learning algorithm that, presented with a new network, can tell which of a few competing theorists wins. And if that works, then we re allowed to look at a real-world network and see which theorist has the best model for some network that they re all claiming to describe. That notion of an algorithm for model testing led to a series of papers that I think were genuinely orthogonal to what anybody else was doing. And I think it was a good example of seeing people whom I respect and think are very smart people but who were not using the right tool for the right job, and then trying to reframe a question being asked by a community of smart people as a prediction problem. The great thing about predictions is that you can be wrong, which I think is hugely important. I can t sleep at night if I m involved in a scientifi c fi eld where you can t be wrong. And that s the great thing about predictions: It could turn out that you can build a predictive model that actually is just complete crap at making predictions, and you ve learned something. 5
11 Data Scientists at Work 9 Gutierrez: How have you been able to join that point of view with working at a newspaper? Wiggins: It s actually completely the same. Here we have things that we re interested in, such as what sorts of behaviors engender a loyal relationship with our subscribers and what sorts of behaviors do our subscribers evidence that tends to indicate they re likely to leave us and are not having a fulfi lling relationship with The New York Times. The thing about subscribers online is that there are really an unbounded number of attributes you can attempt to compute. And by compute, I really mean that in the big data sense. You have abundant logs of interactions on the web or with products. Reducing those big data to a small set of features is a very creative and domainspecifi c act of computational social science. You have to think through what it is that we think might be a relevant behavior. What are the behaviors that count? And then what are the data we have? What are the things that can be counted? And, of course, it s always worth remembering Einstein s advice that not everything that can be counted counts, and not everything that counts can be counted. So you have to think very creatively about what s technically possible and what s important in terms of the domain to reduce the big data in the form of logs of events to something as small as a data table, where you can start thinking of it as a machine learning problem. There s a column I wish to predict: Who s going to stick around and who s going to leave us? There are many, many attributes: all of the things that computational social science, my own creativity, and very careful conversations with experts in the community tell me might be of interest. And then I try to ask: Can I really predict the thing that I value from the things that the experts believe to be sacred? And sometimes those attributes could be a hundred things and sometimes that could be hundreds of thousands of things, like every possible sequence element you could generate from seven letters in a four-letter alphabet. Those are the particular things that you could look at. That is very much the same here as it is in biology. You wish to build models that are both predictive and interpretable. What I tell my students at Columbia is that as applied mathematicians, what we do is we use mathematics as a tool for thinking clearly about the world. We do that through models. The two attributes of a model that make a model good are that it is predictive and interpretable, and different styles of modeling strike different balances between predictive power and interpretability. A few Decembers ago, I had a coffee with a deep learning expert, and we were talking about interpretability, and he said, I am anti-interpretability. I think it s a distraction. If you re really interested in predictive power, then just focus on predictive power. I understand this point of view. However, if you re interested in helping a biologist, or helping a businessperson, or helping a product person, or helping a journalist, then they re not going to be so interested in.08
12 10 Chapter 1 Chris Wiggins, The New York Times error on held-out data. They re going to be interested in the insights and identifi cation of the interesting covariates, or the interesting interactions among the covariates revealed to you. I come from a tradition in physics that has a long relationship with predictive interpretability. We strive to build models that are as simple as possible but not simpler, and the real breakthroughs, the real news-generating events, in the history of physics have been when people made predictions that were borne out by experiment. Those were times that people felt they really understood a problem. Gutierrez: Whose work is currently inspiring you? Wiggins: It s always my students. For example, I have a former student, Jake Hofman, who s working with Duncan Watts at Microsoft Research. Jake was really one of the fi rst people to point out to me how social science was birthing this new fi eld of computational social science, where social science was being done at scale. So that s an example of a student who has introduced me to all these new things. I would also say that all of the kids who go through hackny are constantly introducing me to things that I ve never heard of and explaining things to me from the world that I just don t understand. We had a hackny reunion two Friday nights ago in San Francisco. I was out there to give a talk. We organized a reunion, and the Yo app had just launched. So a lot of the evening was me asking the kids to explain Yo to me, which meant explaining the security fl aws in their API and not just how the app worked. So that s the benefi t of working with great students. Students are constantly telling you the future of technology, data science, and media amongst other things, if you just listen to them. Former students and postdocs of mine have gone on to work at BuzzFeed, betaworks, Bitly, and all these other companies that are at the intersection of data and media. I have also benefi ted greatly from really good colleagues whom I fi nd inspiring. The way I ended up here at The New York Times, for example, was that, when I fi nally took a sabbatical, I asked all my faculty colleagues what they did with their sabbaticals, because I had never taken one. My friend and colleague Mark Hansen did the Moveable Type lobby art here in the New York Times Building. So if you go look at the art in the lobby, Mark Hansen wrote the Python to make the lobby art go, and he did that in 2007 when they moved into this building. So he knew many people at The New York Times, and he introduced me to a lot of people here and was somebody who explained to me though he didn t use these words that The New York Times is now in a similar state to the state that biology was in That is, that it s a place where they have abundant data, and it s still up for grabs what the right way is to use machine learning to make sense of those data.