TECH NOTE Hadoop Alone Is Not Big Data
Twenty-one years ago, a year before the first web browser appeared, Walmart s Teradata data warehouse exceeded a terabyte of data and kicked off a revolution in supply-chain analytics. Today Hadoop is doing the same for demand-chain analytics. The question is, will we just add more zeros to our storage capacity this time or will we learn from our data warehouse infrastructure mistakes? These mistakes include: data silos organizational silos confusing velocity with response time DATA SILOS A data silo is a system that has lots of inputs but few outputs. The Wikipedia page for data warehouse shows an architecture diagram with operational systems on the left, data marts on the right, and a data vault in the middle, but the third definition of vault at Merriam-Webster.com is a burial chamber. All too often, enterprise data warehouses have become data burial chambers, or perhaps, data hospice facilities: places where data goes to die. To prevent this from happening to Hadoop systems, we need more techniques to get data out of the central data store to people and other systems. A few data marts just aren t sufficient anymore for connecting with development partners, ad tech vendors, and the myriad customer touchpoints available to retailers and brands. Data export 2
techniques should cover a variety of performance characteristics so that the best technique can be used for each use case. Such techniques include: good ol batch FTP of flat files, XML files, and compact binary file formats such as Avro publish-subscribe messaging interfaces, AKA enterprise message busses, such as Kafka real-time REST APIs built on high-speed databases such as HBase and Voldemort OLAP and data visualization user interfaces for business analysts who aren t data scientists, such as Pentaho, Tableau, and Simba for Excel. Let s consider the last two in more detail. First, real-time means different things to different people. Fifty milliseconds (1/20th of a second) is real-time for stock trading. Google found that an increase of 500 milliseconds (1/2 a second) in page load time decreases traffic 20% and Amazon found that even a 100 millisecond (1/10th of a second) increase in load time significantly decreases retail website revenue. 1 One-tenth of a second response time is a high bar for APIs to meet. To achieve it at the 95th percentile, retailers need multiple data centers per market so that shoppers always use a data center that is close by, thereby minimizing response times. In short, they need multiple front-end data centers for each Hadoop back-end data center. Secondly, OLAP and data visualization are part of an exciting industry trend toward the democratization of data where the goal is to enable people to access required data themselves, rather than routing queries through some central analytics department. Nike FuelBand, Fitbit, and 23andMe are examples of this trend in consumer products, and OLAP and data visualization are enabling technologies for business users. Democratization of data holds the promise of preventing another big data warehouse mistake from the past: organizational silos. Back-end Data Center Front-end Data Center 1 John Rauser (Amazon) The impact of website performance on conversion, June 8,2004; Greg Linden (Amazon) Make data useful, http://www.scribd.com/doc/4970486/make-data-useful-by-greg-linden-amazoncom; See also Eric Schurman (Microsoft) and Jake Brutlag (Google), Performance related changes and their user impact, O Reilly Velocity, Web Performance and Operations Conference (Velocity 09), June 2009; Philip Dixon (Shopzilla), Shopzilla s site redo - you get what you measure, O Reilly Velocity, Web Performance and Operations Conference (Velocity 09), June 2009 3
ORGANIZATIONAL SILOS An organizational silo, like a data silo, has lots of inputs but few outputs: it s a people bottleneck. Too often, if business analysts wanted data they had to go to a central analytics team, wait in line, get the analytics team to understand their need, wait a few days for the results, realize that the results weren t what they thought they d asked for, and repeat the process until one side gave up. Then, when business analysts complained and asked why on earth it could take so long, analytics just said, There s a lot of math involved. You wouldn t understand. Over the past 20 years, that situation has created a kind of analytics aristocracy that s not very useful. If large companies can create such organizational silos with SQL, BI, and SAS, just imagine the kind of silos they ll be able to create with the new technologies Hadoop, MapReduce, and R. Data democratization is the cure for organizational silos. There s a lot of math involved. You wouldn t understand. 4
VELOCITY VS. RESPONSE TIME The last data warehouse mistake we can avoid with Hadoop systems is confusing velocity for response time. Consider an analogy. Suppose you re shipping a package from Los Angeles to San Francisco, but because of your shipper s infrastructure, it goes through Memphis. If it takes 12 hours from LA to Memphis (1,800 miles) and 12 hours from Memphis to San Francisco (2,000 miles), that s 3,800 miles in 24 hours or 158 miles per hour. Pretty fast. However if you cut out Memphis and go directly from LA to San Francisco (380 miles) in 12 hours then that just 32 miles per hour: pretty slow. Yet the slower route gets the package delivered 12 hours earlier. The point is that velocity should be measured from the customer s point of view, not the infrastructure s, since infrastructure only exists to serve the customer. The following diagram shows what used to be a typical data flow from a customer, through a data warehouse, and then back to the customer, where each of the eight steps was scheduled and run in batch. Even if each link is fast, the whole round trip is rather slow. With cloud-based Hadoop systems we can simplify this and greatly increase response time. Data is pushed directly from Hadoop to front-ends for use by real-time APIs, and to data marts for use by business analysts. Rather than updating customer attributes daily, weekly, or quarterly, this architecture enables real-time updates, click by click. Front-ends Data Mart Hadoop FTP Message Bus Hadoop holds immense promise for adding many more zeros to our storage and analytics capacity, and transforming companies to be more data driven. However to reach its full potential we should avoid the mistakes of the past. Otherwise, we re in for another twenty years of silos, aristocracies, and inadequate response times, or as aristocrats sometimes says, different tree same monkeys. Customer Operational Data Store Operational System Data Mart ETL ETL Staging Area Data Warehouse 5
ABOUT RICHRELEVANCE RichRelevance is the global leader in omni-channel personalization. More than 160 international companies use RichRelevance to turn data into actionable insight, which delivers the most relevant experience for consumers as they shop across web, store and mobile. RichRelevance drives more than one billion decisions every day, and has delivered over $10 billion in attributable sales to its clients, which include Target, Marks & Spencer and Priceminister. Recently, the company opened its cloud-based platform to allow clients to easily merge disparate data sources and build real-time applications tailored to their specific business needs. RichRelevance is headquartered in San Francisco and serves clients in 40 countries from 9 offices around the globe. For more information, please visit www.richrelevance.com. 2014 RichRelevance, Inc. 6