Success Story: Big Data Drives Profits. Brett Farrar. Founding Partner. Sendero Business Services. 2013 ATC Fall Conference

2013 ATC Fall Conference Success Story: Big Data Drives Profits Brett Farrar Founding Partner Sendero Business Services

Success Stories Big Data Drives Profits October 15, 2013

Information Timeline Part I ~50,000-100,000 BC Spoken language develops ~4000 BC Written word is developed 1440 Johannes Gutenberg invents the printing press process 1775 US Postal Service begins 1888 Richard Sears first used a printed mailer 1936 First freely programmable computer 1961 First database invented (IDS) 1970 Relational database first defined 1981 IBM PC introduced 1984 Apple Macintosh introduced 1985 Microsoft Windows introduced 1988 IBM article coins the term business data warehouse 1989 World Wide Web first proposed 1994 Introduction of cookies allows for internet tracking 1997 Term Big Data used by NASA researchers 2005 Hadoop was created 2011 IBM s Watson computer uses Big Data to beat human competitors on Jeopardy 2012 2.5 Exabytes of data are created each day. This number doubles every 40 months 15 October 2013 3

Sears Roebuck & Co. Story Sears Becomes a Brand Powerhouse In 1888, Richard Sears first used a printed mailer (i.e., a catalog) to sell and market the products he was offering in Sears stores. Sears used the catalog to introduce themselves to Americans and to become a phenomenally successful company. Americans looked forward to getting their annual Sears catalog. Key Enablers Written word Printing press Ubiquitous and cheap mail service Observations Organizational commitment and insights came from the top-down The key enablers were in place for over 100 years before a visionary took full advantage Catalog was risky and a big investment 15 October 2013 4

Information Timeline Part II ~50,000-100,000 BC Spoken language develops ~4000 BC Written word is developed 1440 Johannes Gutenberg invents the printing press process 1775 US Postal Service begins 1888 Richard Sears first used a printed mailer 1936 First freely programmable computer 1961 First database invented (IDS) 1970 Relational database first defined 1981 IBM PC introduced 1984 Apple Macintosh introduced 1985 Microsoft Windows introduced 1988 IBM article coins the term business data warehouse 1989 World Wide Web first proposed 1994 Introduction of cookies allows for internet tracking 1997 Term Big Data used by NASA researchers 2005 Hadoop was created 2011 IBM s Watson computer uses Big Data to beat human competitors on Jeopardy 2012 2.5 Exabytes of data are created each day. This number doubles every 40 months 15 October 2013 5

Walmart Story Walmart Becomes the Low Cost King In the 1970's and 1980 s, Sam Walton implemented radical cost-cutting by partnering with the ITsavvy executive Roy Mayer and his data processing protégé, Royce Chambers. Together, they overhauled the company s logistics and upgraded the computer system that tracked merchandise sales and orders. They also partnered with their suppliers to share information and further cut their supply chain costs. They are now the largest company in the world. Key Enablers Computers / computing power Databases for OLTP Communications and partnering with suppliers Observations Organizational commitment and insights came from the top-down They were storing, tracking, exchanging, and using big data before the term was ever coined Big investment in time and money before payoff 15 October 2013 6

American Airlines Story American Airlines Leads the Way in Advanced Analytics Robert Crandall, the former CEO of American Airlines, is credited with developing their frequent flyer program, Sabre reservations, and pioneering their Operations Research and Advanced Analytics group. Consequently, AA has an organizational commitment to data capture and analysis. Their use cases for data analytics include parts allocation/location based on demand prediction, market size forecasting, supply chain optimization, plane configuration, etc. Key Enablers Computers / computing power Centralized databases for OLTP and OLAP Data from third parties People dedicated to asking why and how Observations Organizational commitment and insights came from the top-down Rotable parts allocation resulted in savings of several million dollars per year for one fleet Big investment in time and money before payoff 15 October 2013 7

Large Gaming Retailer Story What Information are You Missing? This company needed to gain more insights into their customers. So, they implemented a loyalty program to incent their customers to share demographic information, contact information, preferences, etc. Before the program this company thought their customers were male, 20- something gamers. Afterwards, they realized their best customers were really middle income families with children. This caused them to change their marketing, store locations, game mixtures, etc. and allowed them to increase their revenue and market share. Key Enablers Computers / computing power Databases for OLTP and OLAP Asking what other data is needed Observations Organizational commitment and insights came from the top-down Big investment in time and money before payoff 15 October 2013 8

Information Timeline Part III ~50,000-100,000 BC Spoken language develops ~4000 BC Written word is developed 1440 Johannes Gutenberg invents the printing press process 1775 US Postal Service begins 1888 Richard Sears first used a printed mailer 1936 First freely programmable computer 1961 First database invented (IDS) 1970 Relational database first defined 1981 IBM PC introduced 1984 Apple Macintosh introduced 1985 Microsoft Windows introduced 1988 IBM article coins the term business data warehouse 1989 World Wide Web first proposed 1994 Introduction of cookies allows for internet tracking 1997 Term Big Data used by NASA researchers 2005 Hadoop was created 2011 IBM s Watson computer uses Big Data to beat human competitors on Jeopardy 2012 2.5 Exabytes of data are created each day. This number doubles every 40 months 15 October 2013 9

Characteristics of Big Data Big Data is usually characterized by some common attributes. Although, not all of these characteristics have to exist at once in a single dataset. Volume o Lots of data! o Petabytes and exabytes Velocity o Speed at which it is collected o Constantly adding new data at a high rate o Machine sensor data, Internet data, web-site customer behavior data, web logs, etc. Variety o Any type of data o Structured data (e.g., database data, CSV files, etc.), unstructured data (e.g., video files, audio files, blog entries, twitter feeds, etc.), and semi-structured data (e.g., log files) 15 October 2013 10

Implications of Big Data The characteristics of Big Data force us to use new ways to store, manage, search, and retrieve data. Traditional / centralized datastores (e.g., Relational databases) are not optimized to handle the characteristics of Big Data o Data is typically limited, minimized, and archived when using centralized datastores because the performance suffers with large volumes of data o Centralized datastores are made to handle structured data where the format/structure is dictated to the datastore before the data is written into it. Therefore, they do not handle unstructured or semi-structured data in a way that allows the data to be searched easily Decentralized datastores (e.g., Hadoop file system) o Developed to handle the characteristics of Big Data o Hadoop has become the most popular decentralized datastore o Hadoop is an open-source project that was developed at Yahoo and is still used extensively at Yahoo to deliver information to customers 15 October 2013 11

How Big is Big Data? 152 million physical items 838 miles of bookshelves = 1 L.O.C. 10 terabytes of printed data 15 October 2013 12

How Big is Big Data? 400 million tweets per day 4 terabytes per day Twitter 2.5 days per L.O.C. 15 October 2013 13

How Big is Big Data? 1 million+ customer transactions per hour 2.5 petabytes of data in storage 250 L.O.C. s 15 October 2013 14

How Big is Big Data? 4.7 billion searches per day in 2011 20 petabytes of data processed per day in 2008 2,000 L.O.C. s of data processed per day in 2008 15 October 2013 15

How Big is Big Data? 42 zettabytes of words ever spoken by human beings 45,000,000,000 L.O.C. s of words ever spoken by human beings 15 October 2013 16

Centralized vs. Decentralized Datastores Centralized Compute & Storage Separate Decentralized Compute & Storage Together Compute (CPU & Memory) Compute (CPU & Memory) and Storage Storage 15 October 2013 17

Centralized vs. Decentralized Datastores Centralized Vertical Scaling Decentralized Horizontal Scaling Implications RDBMS scalability limited by internal capacity RDBMS SPOF require high end HW and SW Big Data scales by adding servers Big Data leverages commodity servers 15 October 2013 18

Centralized vs. Decentralized Datastores Centralized Decentralized Implications Both prevent data loss due to failures Big Data can shift workload to under-utilized servers All queries go through single RDBMS engine 15 October 2013 19

Centralized vs. Decentralized Datastores Centralized Data to the Query Decentralized Query to the Data Query Results Query Results 1 3 1 2 2 Raw Data Implications RDBMS sends large raw data across a network connection, thus slowing performance Big Data minimizes network traffic by sending the query to the data and compute node 15 October 2013 20

Centralized vs. Decentralized Datastores Centralized Decentralized Amount of Data Amount of Data Scaling Scaling Implications RDBMS has diminishing performance improvements as more data is added; more money is spent to try and maintain performance Big Data performance scales almost linearly with investments and amount of data 15 October 2013 21

Centralized vs. Decentralized Datastores Centralized Structure at Store Time Decentralized Structure at Read Time Extract Transform (Apply Structure) Load Extract Load Read & Transform (Apply Structure) Implications RDBMS requires you to know how data will be used/classified at collection time Big Data more quickly and easily allows new insights to be tested because the structure is not applied until read time 15 October 2013 22

Implications of a Decentralized Datastore Because of the architecture provided by a decentralized database and the use of commoditized servers, new things are possible that were not possible previously Cost per Terabyte goes down by a factor of 10+ o Economics now allows for the saving and analysis of trending/historical data that was previously discarded o Can now save and analyze external data and data that was never saved before Can store, process, search, and retrieve any kind of data (unstructured, structured, or semi-structured) o Allows exploration of data that was unavailable or too difficult/costly before o Allows mashups of unstructured or semi-structured data with structured data (e.g., data from transactional processing systems) Fail fast o Centralized databases require development of ETL scripts that require significant design, development, and testing time/cost by IT resources; therefore, it took a long time and a lot of money to validate hypotheses o Hypotheses can be validated much more quickly and quite often without IT involvement 15 October 2013 23

Travel Industry Success Story Comparing Airline Flight Fares A small web-based company (35 employees) in the travel industry that provides comparisons of airlines flight fares across carriers. They started an IT-driven initiative 18 months ago with a small Hadoop cluster (5 nodes) to analyze user behavior data on their website. This analysis helped to improve their SEO results (increasing their traffic by over 100%). It was so successful that the Business partnered with IT to build a second, much larger Hadoop cluster (50 nodes) to combine (1) fare data, (2) flight schedule data, and (3) seat availability data. This has greatly improved the quality of data to their customers, and improved their customer satisfaction. The company has now re-organized to better align with their new data-focused strategy. Key Enablers Hadoop Data from third parties People dedicated to asking why and how Partnering across the organization Fail fast concept Observations Organizational commitment and insights came from the bottom-up Small investment in time/money for pilot; started small and cheap to get quick wins This is a small company (Big Data not just for big companies) 15 October 2013 24

ETL Modernization Business Problem Extract-Transform-Load (ETL) jobs taking too long to run and exceeding the batch processing window ETL jobs not completing successfully and preventing business critical data from being available where it is needed Solution Hadoop solution implemented to replace some (or all) of the offending ETL jobs Nothing changes with the source systems (e.g., transactional processing systems) and destination systems (e.g., Enterprise Data Warehouse) Data movement and transformation from source systems to destination systems speeds up tremendously (quite often by a factor of 2-3+ times) Easy use case that quickly achieves projected ROI and cost justifies bigger and more aspirational use cases Great way to start doing Hadoop / Big Data and to build skills 15 October 2013 25

Risk Mitigation Business Problem Companies with critical historical data in proprietary format or running critical proprietary applications where the supporting vendor is no longer in business or the application is no longer supported These companies are at risk if they run into an issue with the data or application and can not get the support required to resolve their issues The data for these applications are stored on high-end, expensive storage devices (e.g., SAN) to ensure its availability Solution Hadoop solution implemented to store the data in a much more cost-effective way (can be as little as 1/10 to 1/3 the cost) Data can now be searched in new ways Easy use case that quickly achieves projected ROI and cost justifies bigger and more aspirational use cases Great way to start doing Hadoop / Big Data and to build skills 15 October 2013 26

Supply Chain and Logistics Business Problem Manufacturers need just-in-time availability of components Stock-outs cause harmful production delays Sensors and RFID tags reduce the cost of capturing more supply chain data, which needs storage and processing Solution Big Data architecture stores unstructured, streaming, dirty sensor data Manufacturers get lead time to make alternative arrangements for supply chain disruptions Prevent stock-outs, reduce supply chain costs and improve margins for the finished product 15 October 2013 27

Assembly Line Quality Assurance Business Problem High-tech manufacturing uses sensors to capture data at critical steps in the manufacturing process Sensor data helps diagnose errors with returned products Much data is discarded, because of high storage costs Lean margins mean small budgets for data analysis Solution Big Data architecture stores unstructured, streaming, dirty sensor data Manufacturers can proactively analyze more data, over a longer time, to detect subtle issues otherwise undetected Sensor data managed with a Big Data architecture can help a manufacturer reduce warranty costs and earn a reputation for quality 15 October 2013 28

Proactive Maintenance Business Problem Today s manufacturing workflows involve sophisticated machines coordinated across pre-defined, precise steps One machine malfunction can stop the production line Premature maintenance has a cost; there is an optimal schedule for maintenance: not too early, not too late Solution Big Data architecture stores unstructured, streaming, machine data Manufacturers can derive optimal maintenance schedules, based on real-time information and historical data Maximize equipment utilization, minimize P&E expense, and avoid surprise work stoppages 15 October 2013 29

Crowdsourced Quality Assurance Business Problem Thoroughly tested products still have post-sale problems Customers may not report problems to the manufacturer, but still complain about the product using social media This social stream of data on product issues can augment product feedback from typical support channels Solution Big Data architecture stores huge volumes of social media sentiment data Manufacturers can mine this data for early signals on how a product holds up after delivery to the customer Learn about issues quickly and take early action to protect the product reputation and win customer loyalty 15 October 2013 30

Other Use Cases 15 October 2013 31

The Big Data Landscape v1 15 October 2013 32

The Big Data Landscape v2 15 October 2013 33

Summary Big Data only describes a problem or situation. To ensure that you get the most value out of your Big Data projects, you need to do the following: Craft a goal-oriented plan with quick wins Fail fast Use the data you already have Don t limit yourself to only the data that you have Partner effectively across organizations Commit to operating business differently Ask why and what else Choose the right tool in your toolbox (RDBMS, Hadoop, etc. all have their place) 15 October 2013 34

Q&A E-Mail: Bret.Farrar@SenderoCorp.com 15 October 2013 36