+ BIG DATA ANALYTICS Vishy Venugopalan
+ AGENDA n Introduction: The Age of Big Data n The Analytics Adoption Curve n The New Data Stack n Opportunities in the Big Data Analytics Market n Investment Candidates
+ WE LIVE IN THE AGE OF BIG DATA n IDC: Worldwide Big Data market (excluding infrastructure and storage) projected to be a $16.5bn market in 2015, growing at a 40% CAGR between 2010-15. Source: The Economist, Data Data Everywhere, Feb 2010 n McKinsey: 15 out of 17 sectors in the US economy have more data stored per company than the US Library of Congress.
+ THE TRADITIONAL DATA STACK Applications Transactions and analytics (OLTP/OLAP) Business intelligence Data management Infrastructure Row storage Columnar storage Hardware Disk Solid-state devices
+ THE ANALYTICS ADOPTION CURVE EARLY GROWTH MATURE Who drives data analyses? Engineers Technicallyoriented business analysts All business analysts Type of analyses conducted Custom-built, high-touch Simple, selfservice Complex, selfservice, ad hoc Analysis tools used Programming languages Query languages Visual, drag and drop tools
+ THE TRADITIONAL DATA STACK IS FACING CHALLENGES n Not built for petabyte scale, for semi-structured data or realtime data n Relational databases are being complemented by NoSQL databases and alternative storage technologies n Hadoop: open source community + commercial innovation is building a parallel data stack that overcomes these limitations n Pioneered at Internet scale companies (Google, Yahoo, Amazon)
+ THE NEW DATA STACK Applications Infrastructure Query+Analytics (Hive, Pig) DB management (Zookeeper) Distributed file system (HDFS) Full-text search capabilities (Solr) NoSQL /alt. storage Hardware Distributed storage Solid-state devices
+ OPPORTUNITIES IN THE BIG DATA ANALYTICS MARKET EARLY GROWTH MATURE Who drives data analyses? Engineers Technicallyoriented business analysts All business analysts Toughest challenges Workflow and coordination Analysis tools for standalone data Integrating disparate data sources Startups to watch: Short-term: startups offering platforms that address the workflow, coordination and handoff problems Medium-to-long term: startups that provide effective tools for selfservice analyses and integration with traditional data stack
+ INVESTMENT CANDIDATES Seed/Bootstrapped Seeking Series A Post-Series A
+ THE BIG PLAYERS ARE UNSURE OF THE WAY FORWARD n The data and analytics stack is undergoing a generational shift. Big Data represents a new kind of product (petabytescale) running on a new kind of infrastructure (cloud-scale). n For now, major data players IBM, Oracle, Microsoft are making partnerships and formulating strategies for the world of Big Data. n From a product perspective, changes are akin to platform shifts from mainframe to PC, or more recently, the systems management shift from physical servers to virtual servers.
+ CONCLUSION n We are in the age of Big Data, where the amount of data generated by businesses and consumers is unprecedented. n The mainstream data stack today, particularly the Business Intelligence subsegment, is built for datasets of the 90s and is ripe for change. n Internet-scale companies were the first to notice this problem. Their efforts seeded a new data stack. n In the short term, startups that solve the workflow and coordination problems are attractive investment candidates; in the longer term, tools and data integration will produce winners.
+ APPENDIX Detailed individual summaries of companies
+ MORTAR DATA (Boston, MA) VALUE PROPOSITION USER & CUSTOMER PROFILE USE CASES COMPETITION STATUS TEAM Easy browser-based environment to run Hadoop jobs on data that lives in the cloud. Business analyst at an SMB Finding patterns in clickstream data, log data. Requires Hadoop jobs written in a consumable manner by developers. Amazon Elastic MapReduce at the basic level; Hapyrus; StackIQ Founded Aug 2011. Seed stage. Raising $450K ($110 committed). Just started TechStars Boston 3 employees. All technical. Met at university. Worked at Wireless Generation together.
+ DEMYST.DATA (New York, NY) VALUE PROPOSITION USER & CUSTOMER PROFILE USE CASES COMPETITION A way to predict consumer credit risk profiles from unstructured data available all over the Web. Credit underwriter at a prepaid card issuer, check casher, payday loans provider An alternative to FICO scores, our algorithm picks 2-3 attributes of an individual s online presence relevant to their credit risk. TransUnion, Equifax etc; Limited functional overlap (but no customer overlap) with Palantir STATUS F&F funded. Series A raise in Q412. TEAM Two Columbia MBA grads. One of them is ex-lexis Nexis.
+ HADAPT (Cambridge, MA) VALUE PROPOSITION USER & CUSTOMER PROFILE USE CASES COMPETITION STATUS TEAM An analytic database that enables SQL queries against Hadoop data. Anyone who uses a data warehouse. Finding patterns in unstructured data that lives in a database (e.g. BLOBs); eventually, integrating unstructured and structured data in one warehouse Apache Hive; Vertica (only for structured data queries) Late beta. 10/11: $9.5m Series A (Norwest, BVP); Series B in early 13. CTO worked at MIT on C-STOR, which later became Vertica. Now a Yale professor. Management comes from Endeca, Aster Data etc.
+ MAPR TECHNOLOGIES (San Jose, CA) VALUE PROPOSITION USER & CUSTOMER PROFILE USE CASES COMPETITION STATUS TEAM Enterprise-class Hadoop distribution with proprietary extensions Data scientists and software engineers at large and small organizations. Processing semi-structured data using Hadoop. Integrates easily with existing enterprise storage (NAS clusters etc). Allows stream-based processing. Cloudera, HortonWorks, Apache Hadoop 8/2011: Series B $20m (Redpoint, Lightspeed, NEA) CTO headed up Google BigTable group. Founded fast clustered NAS startup before.
+ ZETTASET (Mountain View, CA) VALUE PROPOSITION USER & CUSTOMER PROFILE USE CASES COMPETITION STATUS TEAM Enterprise-class Hadoop management and deployment tools Data scientists and software engineers at large and small organizations. Processing semi-structured data using Hadoop. Has a particular risk management and information governance focus. MapR, Cloudera, HortonWorks, Apache Hadoop 4/2011: Series A $3m (DFJ, Epic Ventures) 15 employees (12 technical). Founder founded SPI Dynamics, web application security software (acq by HP)
+ HSTREAMING (Chicago, IL) VALUE PROPOSITION USER & CUSTOMER PROFILE USE CASES COMPETITION Stream-based processing for Hadoop, similar to Complex Event Processing Data scientists and software engineers at Fortune 500 organizations. Over 20 customers at this time. Processing semi-structured data using Hadoop. Has a particular risk management and information governance focus. IBM InfoSphere, Microsoft StreamInsight, StreamBase, S4, Storm STATUS Self-funded so far. Raising Series A. TEAM 3 employees (2 technical). One of the founders worked on similar product at IBM.
+ RADOOP (Budapest, Hungary) VALUE PROPOSITION USER & CUSTOMER PROFILE USE CASES COMPETITION STATUS TEAM Graphical data mining interface (a la RapidMiner) for Hadoop data stores Code free product, used by business analysts at large+small Hadoop shops Requires a Hadoop cluster at the moment. However, GA product will provide them log reduction and analytics tools without exposing Hadoop. Datameer, Karmasphere, Splunk, RapidMiner Private beta. Self-funded. 1000 beta users. 6 engineers (all technical). Recent PhD candidates in computer science from Hungary
+ HAPYRUS (Palo Alto, CA) VALUE PROPOSITION USER & CUSTOMER PROFILE USE CASES COMPETITION STATUS TEAM Browser-based solution that enables more effective collaboration and workflow between engineers and data analysts analyzing large datasets. Companies with rapidly growing data that already lives in the cloud. Ideally uses S3 and Elastic MapReduce. Engineers can write templated Hadoop jobs in which business analysts can change parameters and perform Datameer, Apache Hive $700K from 500 Startups and Japanese angel investors. Next round in 2013. 3 employees (2 technical).
+ COGNIER (Santa Clara, CA) VALUE PROPOSITION USER & CUSTOMER PROFILE USE CASES COMPETITION STATUS TEAM Solution for analyzing timestamped, semi-structured data Business analysts with limited technical backgrounds, who want to graphically visualize analyses Analyze unusual variations in the data, particularly over time. E- commerce, SaaS and mobile app customers are most common. Web analytics (Google Analytics, TeaLeaf); BI companies (Cognos, Business Objects); Splunk Bootstrapped. 3 months from GA. Looking for Series A in late 2012. 3 employees, ex-stratify (ediscovery startup acq by Autonomy)
+ KAGGLE (San Francisco, CA) VALUE PROPOSITION USER & CUSTOMER PROFILE USE CASES COMPETITION STATUS TEAM Statistical outsourcing platform for modeling and prediction competitions. Turns data science into a sport. Organizations of all sizes with limited resources (talent or infrastructure) to analyze large datasets internally Anyone can post a competition on Kaggle with a well-defined objective and a prize for the IP behind solution. (indirect) Crowdflower, Innocentive, TekScout Nov 2011: $11m Series A by Index Ventures and Khosla Ventures. Max Levchin, Hal Varian are also investors. Under 10 employees. Founded by Australian data scientists.
+ TRESATA (Charlotte, NC) VALUE PROPOSITION USER & CUSTOMER PROFILE USE CASES COMPETITION STATUS TEAM Bringing the power of Hadoop to financial industry data (structured and unstructured). Available onpremise or on the cloud. Retail and institutional financial services customers. Massively parallel analytics correlating own data and public data: financial and non-financial (e.g. social). Note: certain data can be provided by Tresata s own partners. Datameer, Palantir $1.5m in seed and angel financing. <10 employees. Founders are ex- Bank of America.
+ PLATFORA (San Mateo, CA) VALUE PROPOSITION USER & CUSTOMER PROFILE USE CASES COMPETITION STATUS TEAM Platform offering interactive business intelligence reports that are translated on the fly into scalable, parallel Hadoop jobs. Visualization in the form of dashboards and reports. Business-facing data analysts at companies with large datasets: Internet/e-commerce, telecom, logistics, finance Any currently fulfilled by traditional data warehouses, BI and ETL tools. Datameer, Apache Hive Series A $7.2m by Andreessen- Horowitz, Sutter Hill Ventures, In-Q- Tel 10 employees. Founder/CEO is ex- Greenplum.