Making big data come alive NAVIGATING THE BIG DATA JOURNEY Big Data and Hadoop: Moving from Strategy to Production London Dublin Mumbai Boston New York Atlanta Chicago Salt Lake City Silicon Valley (650) 949-2350 thinkbiganalytics.com
While Hadoop is a maturing technology, there are still many companies experimenting with big data initiatives, and other companies that don t exactly know where or how to begin. As technology continuously changes, charting a successful course for big data investments is no easy task even for organizations that have moved beyond the pilot phase. How far along are you in your big data journey? Wherever you are in the journey, there can be any number of challenges standing in the way of successfully harnessing the power of big data and maximizing its return, such as: Moving past big data strategy and into production Avoiding common stumbling points in the development cycle Determining which big data skills are needed now and in the future Aligning business and IT groups around common big data use cases Focusing on one particular use case to the detriment of others Where to begin? The starting point for any big data journey begins with developing a big data strategy that is paired with a business vision, with appropriate sponsorship from both IT and business. A best practice is to first have a strategy in place and then proceed to application development, followed by solution deployment. As Hadoop and big data become more pervasive within the enterprise, organizations are developing their own ideas of how to move forward. Many companies have done research into how they will use big data and even identified low-hanging fruit where they can derive quick value from one use case. The challenge with this approach is that they ve developed a solution for a single use case, but not multiple initiatives with a long-term roadmap. This narrow focus can stall big data projects from realizing their full potential. A strategy for success A big data strategy should provide three core components: An analysis of how Hadoop can be used to drive business value A roadmap built on a collaborative vision from business and technology stakeholders A comprehensive architecture definition supporting differentiated use cases A roadmap helps identify how both business and IT teams can drive value out of using Hadoop, as well as determining architecture investments, deciding which datasets will be landed first, and identifying pilot stakeholders who will carry over to implementation. A roadmap should serve as a guide for moving the big data initiative forward over the next 12 months. An architecture definition is not a detailed design, but rather a high-level architecture that identifies the core functions of a big data solution that will support the multiple use cases on the roadmap use cases driven by the joint vision of business and technology teams. When identifying uses cases, there might be dozens that will surface. The goal is to pick the most valuable use cases in which to invest over the next 12-months. Also, keep in mind that your big data architecture and design will have to support all of the functions and features that will be needed, such as: real-time, long-term storage; data access management; datasets exported to Hadoop for end users; various tools and distributions; metadata strategy, etc. The system should be built for long-term value, not just a few use cases. Often when your big data solution is built with top use cases in mind, the next set of use cases naturally follow down the road. Navigating the Big Data Journey / 2
From strategy to development With the strategy created and guiding the way, the next phase is development. Targeting an initial use case that can be quickly implemented is a proven way to accelerate business value. One best practice is to start an initiative that will get end users involved early while also supporting long-term investment. Aiming too high and setting lofty goals can delay results. It s not acceptable to ask business users to wait 10 months or more for an application to be deployed. Defining an operations support plan is also an important step, because your applications and operations teams will need to work closely together to build skills internally. Cluster setup, configuration, and ongoing support are critical to assist application development. In big data projects, the systems administrator will often be tapped to be the Hadoop administrator and will have a large learning curve, which will also require ongoing support. For many organizations, scaling activities around Hadoop and trying to develop production data flows can be an area of significant challenge. That s why checks and balances on data quality are essential to ensure accurate data by planning ahead on ways to ensure reliability. It s important to consider what your end user will see and where data will be stored, and to examine application needs for how frequently business users require data to be pushed. From development to production The best practices outlined below are not new or revolutionary, but they are critical to achieving success when moving from big data development to production. There can be an explosion of data when the solution is deployed, which is why capacity planning is a must. Start by creating test runs in the pre-production environment to estimate the footprint (disk space and memory) and how much data will be processed (ingesting, storing, processing Big Data Strategy and Roadmap: Think Big Engagement Model Creating a big data strategy and roadmap with Think Big involves four phases. Discovery Collaborative business and technology workshops discuss data challenges, data opportunities, the value of being able to ask new questions of new datasets, etc. Identification and prioritization of high-value use cases typically identify 50 to 60 uses cases and determine which ones could deliver the most use and value. Architecture Definition Architecture recommendations Finalize the use case list and perform criteria scoring Readiness Analysis Development of capability definitions, including organization and training identify skill gaps to develop training to properly use the new technology and tools, and maintain solutions after implementation. Analysis of use cases against current and future technology, organizational strategy and data identify access patterns by use case, which will then drive tools and applications on the cluster. Roadmap & Recommendations Creation of 12-month roadmap based on priorities piecing together all components in a sequence plan with business and technology stakeholders, layering use cases and the length of time it will take to implement with the required investments. Identifying the different datasets that will land in a Hadoop cluster and understanding business milestones and how they align with the overall big data plan. Executive presentation securing the investment needed to go forward with your big data initiatives. Navigating the Big Data Journey / 3
in-memory). Using a few months of data can be a good start and may provide an idea of how much capacity a few years of data would consume. Performance testing is another must. Performance on Hadoop can be estimated by running pre-packed jobs, such as TeraSort. Doing so provides the big data team a simple baseline for expected throughput of the cluster. Tuning the cluster based on observations from these job runs is much easier than tuning the cluster with target applications. Cluster tuning that is specific to the hosted applications is still important, but should come after baseline cluster configuration/optimization. Success in moving from big data development to production goes back to a strong partnership between application and operations teams, because daily workshops may be needed when beginning the data ingestion to analyze log data and troubleshoot issues. Making big data come alive Think Big, a Teradata company, provides data science and engineering services that enable organizations to accelerate their time to value from big data. As the first big data services firm, Think Big s data scientists, data engineers, and project managers are trusted advisors to the world s most innovative companies. Visit thinkbig.teradata.com. Moving past big data dreaming and pilots and into full production requires firmly grasping the right analytic priorities, architecture, infrastructure, skills, and support. For more details, check out this big data and Hadoop webinar with Mike Portell, director of client services at Think Big. Avoiding Big Data Pitfalls Strategy Lack of business sponsorship for implementation a big data strategy could be delayed or even dropped if support from the business has not been identified. Investing too early in a Hadoop cluster experimenting is fine, but too many hands in a cluster will lead to misconfigurations and stall moving to implementation. It s also important to consider what the environment will be used for in the future and if any remediation will need to be done to move forward with application development. Development The Hadoop ecosystem evolves quickly new, major capabilities are introduced frequently and it can be important to take advantage of these new capabilities. Be aware of how changes in the environment will affect existing applications and if they could lead to significant code remediation. Change management for analysts it s important to get end users (data scientists and analysts) involved early by pushing new threads to them quickly and asking for feedback so when deployment happens, there will be fewer issues. Production Data multiplication duplicate data and metadata can quickly increase data size. In the beginning, storing as much data as possible will deliver the most value, but data multiplication will happen when it eventually gets ingested into a cluster. It s important to know how data will be multiplied once they are ingested and how to create checks and balances on the data so they remain manageable. Historical data load and performance significant processing on ingest could lead to an unexpected historical data load timeline. This stage offers an opportunity to configure and tune the cluster and application to make it more reasonable. Navigating the Big Data Journey / 4
High Tech Manufacturer Implements Big Data Strategy to Improve Yields A large manufacturer of external storage products had a gold mine of historical log data from its drives, but needed a way to unlock the tremendous value hidden in this drive DNA. Historical log data would drive new analytics from information that wasn t previously used. The storage-technology manufacturer turned to Think Big to embark on a program to build a big data platform to reduce the amount of time engineers spend searching for data, facilitate large scale analytics for yield improvement, and work with customers to identify problems before they happen. Think Big has worked with the manufacturer on big data strategy, architecture design, data lake implementation, and analytic solution development, helping deliver rapid value while also stabilizing the environment for production. Over the course of two years, Think Big has helped the manufacturer move from strategy to development to production for two major big data applications. Through the platform, the manufacturer has uncovered opportunities to reduce scrap waste, speed time to market, and gain more timely analytics and insights. This unique end-to-end data analysis provides significant benefit to the bottom-line through reduced development time, improved manufacturing yield, and increased customer satisfaction. Think big. Start smart. Scale fast. Contact us to learn more about how we can help you make big data come alive to deliver true business value. EB-7095 0615 London Dublin Mumbai Boston New York Atlanta Chicago Salt Lake City Silicon Valley (650) 949-2350 thinkbiganalytics.com