and Brief Journey into the New World of Next Generation Data Infrastructures IT Transformation Advisory Proposal

Transcription

1 and Brief Journey into the New World of Next Generation Data Infrastructures IT Transformation Advisory Proposal Dr. Gopal Dommety, Gopinath Rebala, Dr. Vineet Dhyani, Paddy Ramanathan (N42) Dr. Riad Hartani (Xona Partners) December 2013

2 Page 2 Data Management and Infrastructure Transformation 1 Preamble The last decade has seen the emergence of a large number of lead information centric technology players that share one thing in common: without a reliable large scale data management infrastructure, business will rapidly converge to zero. Lead players such as Google, Facebook, Amazon, Yahoo, Baidu, Ali Baba and a select few, know this way too well, and have as such, made their infrastructure management a top priority. The next decade is looking much of the same, with one caveat that top priority is now becoming the first priority. In fact, this same data and Information technology (IT) infrastructure, which has evolved pretty much linearly over the years, is soon to see significant architectural disruptions, with the goal of making it, multiple degrees of magnitude over, more scalable, reliable, intelligent and manageable. This is exactly where, N42, Inc. ( N42 ), in partnership with Xona Partners, has set sights in terms of putting together a transformational data and information technology practice to address these upcoming challenges. In retrospect, and looking back at the IT doesn t matter seminal paper (Harvard Business Review circa 2003), we believe and are ready to say that 2014 onwards, IT matters more than ever. Initiating our analysis from N42 s Menlo Park, California s silicon valley headquarters, and Xona Partners San Francisco & Singapore headquarters, and leveraging the hands-on know how of our teams out of Hong Kong, Bangalore and London, we set sail for a journey around what we see in terms of Data Infrastructure evolution challenges, and highlight our evolving contributions aiming at overcoming them. This short positioning paper, is presented as a baseline for follow up detailed discussions related to the various topics under consideration, with the goal of designing, customizing and optimizing our solutions to lead data centric organizations needs, leveraging the broad and complementary expertise of our team. It specifically builds off N42 successful cloud and data transformation rollouts in the recent past, as well as Xona Partners extensive IT transformation and Cloud migration audits, strategies and implementations. The fundamental premise of the question we are addressing is fairly simple: As a business that is built around gathering large scale data sets, mining and learning through such data, and optimizing communication between those producing it and those using it, where shall I go from here? This would touch upon the evolution of the compute platform over which data sets sit, but also addresses aspects as varied as Business Intelligence, Data Warehousing, Reliability & Availability, Performance & Scalability, Monitoring, Metrics & Diagnostics, Real-time Analytics, Data API and Governance and additional requirements such as security. 1.1 Rationale for a data infrastructure evolution revisit Most information technology players, are, or will soon be considering enhancing their data infrastructure to help build data solutions which will optimize their ability in identifying, capturing, and managing data to provide actionable, trusted insights that improve strategic and operational decision making, resulting in incremental revenues and a better customer experience. The current challenges of the existing platforms mostly affect the operations teams ability to provide reliable SLAs for the reporting jobs that are critical for the business, the data platforms

3 Page 3 Data Management and Infrastructure Transformation required scale, and perform effectively as the business grows across distributed geographies. As such, the desired goal is to create a solid foundation architecture that is able to provide these optimal functional capabilities, and a platform to overlay additional applications such as intelligent business intelligence and Data Science as a service capability. For most players, an insider-only approach, leveraging internal resources, would only go so far in terms of architecting and designing this new generation architecture, given the breadth in terms areas of expertise required, and more importantly, the need for a step-back and outside the box design way of doing things. As such, our teams aim at being the strategic advisory partner bringing in such expertise, and build open models tested and validated over a large set of data platform development models. 2 Data Infrastructure The Road Ahead Our approach to these IT transformation goals, based itself on our understanding of existing big data architectures and deployment models and the assumptions we set as far how data information models are set and the underlying performance and reliability requirements. Additionally, our approach aims at providing a capability maturity path for increasing capabilities of the system with minimal disruption to current operations, forming the basis for an evolutionary migration. As such, the architectural models we built our platforms upon, are designed to address some of the persistent problems in current infrastructure and future needs of the organization for data processing. 2.1 Current Data Infrastructure a revisit As of today, the existing information infrastructure and analytics processes suffer from challenges we have observed and worked on in the most typical large scale data and IT projects, some of which are listed below: Data quality issues: data consistency and completeness issues due to intake problems Stringent SLAs are not met for business critical jobs Implementation and orchestration of jobs in Java/Perl is fairly unproductive Incomplete metering, monitoring, and diagnostics of distributed software jobs and processes Increasing time of completion for software jobs as ad-hoc usage increases Data architecture issues regarding naming, change management, and running analytics Lack of comprehensive policy-based management for retention, replication, archival, and compression Lack of incremental analysis with real time data - generates reports and events only from near real-time data Lack of comprehensive reports with near-real time granularity as opposed hours or days Consolidate multiple data centers to a combined cluster for data analysis

4 Page 4 Data Management and Infrastructure Transformation We believe that monitoring for usage patterns and setting policies for meeting acceptable SLAs along with capacity planning will give predictability and performance to the system. A thorough process to resources mapping analysis for better understanding of system usage will lead to novel architectures that will allow for costing and time estimates for meeting the required SLAs. In the upcoming sections, we position our thesis on how these data architectures should be evolving, short, mid and long terms. The analysis is presented in a generic fashion, but builds on top of very selective and specific case studies that we have worked on in the real world, developed and completed, leveraging our designed methodologies, processes and tools. In other words, what is described is a pragmatic successfully completed case study, albeit made generic, to show broader applicability. We believe that lead IT data architects will associate with the assumptions, the models and the approach, and would see in it a template in which we would work with, hand in hand, to evolve their IT and data architectures. 2.2 Data Infrastructure Optimization: The low hanging fruit Our approach takes into account the capability maturity path by introducing components to the system to address pressing needs of reliability and predictability. Additionally, we introduce components to meet SLAs of what we anticipate, as upcoming system level jobs and to support performance requirements of the future. A representative architecture is presenter here that will support predictable process completion and faster analytical query processing.

5 Page 5 Data Management and Infrastructure Transformation Components added to the system are described with their functionality in this section. The suggestions here assume that there is no major upgrade to Hadoop infrastructure. Version information for components will be the compatible versions to existing infrastructure. - N42 Monitor - Flume - Oozie - Hive on Spark - Data API This architecture, when translated into an operational mode, needs to be fine-tuned based on analysis of the exiting actual infrastructure and operations. The analysis phase will include evaluation of current infrastructure usage, failure characteristics, and data architecture leveraging in-house custom designed tools and processes. The results will be evaluated by a team comprising of data warehouse architects, system architects and domain specific functional experts resulting in capacity recommendations, data architecture recommendations and transition/scaling plan minimizing risk to critical business operations. These various aspects are briefly highlighted below: Business Intelligence The architecture components are designed to provide off the shelf analytical components to fit in with minimal integration work. There are a few data visualization and reporting applications providing dashboards, charting and spreadsheets. Some of them would have been integrated with Hive and/or can work with JDBC, REST/Json formats. Pentaho, Talend, Tableau, Datameer, and Platfora are some of the options available for BI tools. Platfora is a good choice provides a good mix of integration with Hadoop ecosystem and easy to use frontend for data analysis. Data API layer on top of the Data Warehousing infrastructure is a software layer developed according to the requirements to provide security and decouples third party access with data architecture Data Warehousing Hive on Spark allows for in memory queries for analytics that provide near real time analysis of data as opposed to running Hive using MR batch processing on HDFS. This component will address SLA requirements of the reporting solution without having to implement the existing reports but with added performance. Additionally, this provides better data import/export than MongoDB nosql solution with better performance for lower cost. Hive provides better ability for ad-hoc queries and supports SQL type syntax for running analysis without having to write MR jobs in Java or Perl. Additionally, this provides horizontal scalability on existing cluster without additional data transfers in and out of cluster.

6 Page 6 Data Management and Infrastructure Transformation Reliability and Availability The custom designed N42 SLA and policy-monitoring tool provides visibility into job/workflow performance, guidance for performance improvement and with integration with Ganglia provides a comprehensive dashboard. This product allows for job/workflow level monitoring of cluster usage and provides predictability to job completion times improving the reliability of the system. Flume provides a reliable way of aggregating logger files from various web servers as well as streamlining input from tracker and audience data. Flume provides spooling directories, JMS source and plugins capability that can be used to customize Flume data ingest to detect missing expected data. Custom plugins can also be used perform some client-side cleaning before ingestion in to Hadoop with possible reduction in the amount of data ingested. Oozie provides event based workflow mechanism for launching jobs in the event of data ingest into HDFS or HCatalog. Additionally, Oozie provides an easy way of specifying job workflow including Pig and Hive jobs allowing SLA specification for workflow. These components will provide critical data reliability for running reports, data quality issues and will address the late data problem effectively even if dependencies on external systems are not completely eliminated Performance and Scalability Performance and scalability improvements are achieved using Hive on Spark as well as using N42 custom designed platform for monitoring and analysis. Linear scalability and performance with scale can be achieved by using the Hadoop 2.0 architectures. Hive/Spark along with proper processes in capacity allocation will resolve issues with reliable reports and ability to rerun reports and provide for better ad-hoc query performance Monitoring, Metrics and Diagnostics N42 tools provide a dashboard for comprehensive monitoring of the cluster using Ganglia monitoring system or existing monitoring system along with Job profile and analysis. This in turn, provides predictability to job completion times based on job profiles that provides excellent diagnostic capability for job performance and predictability. Cluster problems with out of memory, disk issues or job failures can be detected effectively and can be addressed proactively Data API Data API is a virtualization layer that hides the underlying platform details and provides REST or JDBC interfaces for external interaction. There are no on premise Data API solution providers, some providers like Qubole provide Data API services as SaaS model. Some solutions allow simplifies data import export to nosql databases. These solutions can be integrated to provide a consistent data view to external actors.

7 Page 7 Data Management and Infrastructure Transformation 2.3 Proposed Long Term Data Infrastructure A longer-term proposal is to migrate to YARN based Hadoop MR processing along with improved real time analytics for event processing using Storm as illustrated below. This proposed architecture provides the following features to existing system Improved data ingest process for achieving high confidence and consistency in data analysis even in the face of variable data loads Policy and SLA specification for workflows Visibility into job execution, cluster state and capability to debug jobs Faster ad hoc query processing for analysis Data API for integrating faster integration with BI tools Real-time data analysis and reports High Availability with fine grained data governance Improved Data Life cycle management

8 Page 8 Data Management and Infrastructure Transformation Components added to the system are described with their functionality in this section. We recommend using the latest version of distributions from Cloudera, Hortonworks for integrated components, better maintenance and performance improvements. One illustrative deployment is described below. - N42 Think - Hadoop Flume - Kafka - Sqoop - Oozie - Pig - Spark - Hive - HBase - Storm - Hue - Zookeeper - HCatalog Business Intelligence The architecture components are designed to provide off the shelf analytical components to fit in with minimal integration work as described in short term architecture. Platfora is a good choice provides a good mix of integration with Hadoop ecosystem and easy to use frontend for data analysis as mentioned in earlier section. Platfora supports ability to provide heat maps, charts and drilldowns to publishers Data Warehousing HBase is very efficient in fast time range scanning, time range queries, data drilldown etc. in the face of read only data with low throughput write data. Additionally, HBase supports quick snapshots and is ideal data warehousing platform. Data cubes stored in HBase allow cube operations such as pivoting, drilldown etc. via HBase. HBase is a good data warehousing option in terms of cost/performance for report generation. Hive on Spark allows for in memory queries for analytics that provide near real time analysis of data as opposed to running Hive using MR batch processing on HDFS. This component will address SLA requirements of the reporting solution without having to implement the existing

9 Page 9 Data Management and Infrastructure Transformation reports but with added performance. Additionally, this provides better data import/export than MongoDB nosql solution with better performance for lower cost. Hive on HBase supports ability to rerun batch queries, faster performance for ad-hoc queries, 30 minute resolution reports, and higher dimensional data store for better filtering and analysis of data with SQL query language Reliability and Availability Hadoop 2.0 supports distributed Jobtracker and high availability to Datanode. This avoids single point of failure for the Hadoop deployment. HDFS replication itself lends to high availability of data on file system. Zookeeper should be implemented with 3 nodes for high availability of cluster. Data policies for archiving and snapshots through HBase will provide reliability and disaster recovery options for the cluster. N42 designed SLA and policy-monitoring tools provide visibility into job/workflow performance, guidance for performance improvement and with integration with Ganglia provides a comprehensive dashboard Data ingest is one of the critical first steps in achieving data consistency for analysis. Data is received from large number of sources for tracking and audience back ends. Process needs to be in place to detect missing data from expected sources and ingest data into HDFS efficiently with error detection and optimization. Flume allows predictable and efficient data ingestion into HDFS file system providing visibility into failures and improving performance of data ingestion. Missing data can be detected with custom plugins to Flume pipeline. Depending on the requirements, it is possible to use Kafka in the pipeline for reliable delivery of data for preventing data loss. Kafka addresses the following scenarios Loss prevention for spike in logger and audience data that exceeds the capacity for HDFS ingest MR job failures at queue sink or any failures in the pipeline Oozie provides event based workflow mechanism for launching jobs in the event of data ingest into HDFS or HCatalog. Additionally, Oozie provides an easy way of specifying job workflow including Pig and Hive jobs allowing SLA specification for workflow. This implementation will allow better quality of data for reliable reports and better performance on scheduled reports as well as ad-hoc queries. Performance and Scalability Cloudera deployments are installed with Fair Scheduler as default option. For achieving high throughput for the jobs, we recommend using Capacity Scheduler as scheduler of choice. Performance and scalability improvements are achieved using Hive on Spark as well as using N42 platform for monitoring and analysis. Linear scalability and performance with scale can be achieved by using the Hadoop 2.0 architecture as defined in the next section.

10 Page 10 Data Management and Infrastructure Transformation This cluster is designed to be single cluster to support data needs that can consolidate all or some, of its data centers. Processes and policies in place for data lifecycle management for archiving, retention, compression and replication will allow for efficient data management with low overhead costs Monitoring, Metrics and Diagnostics The N42 platform provides a dashboard for comprehensive monitoring of the cluster using Ganglia monitoring system or existing monitoring system along with Job profile and analysis. It provides predictability to job completion times based on job profiles that provides excellent diagnostic capability for job performance and predictability. Fine-grained estimations on cluster usage by user, job type and time of day will allow for better policies for cluster usage and planning. The diagnostic insights lead for high performing jobs, better data design and lower failures in the cluster Data API Data API is a virtualization layer that hides underlying platform details and provides REST or JDBC interfaces for external interaction. There are no on premise Data API solution providers, some providers like Qubole provide Data API services as SaaS model. Some solutions allow simplifies data import export to nosql databases. These solutions can be integrated to provide a consistent data view to external actors. Application development using the data interfaces that are decoupled with data storage structure will lead to lower cost of maintenance and better integration with partners Real-time analytics Real-time data processing has to accommodate high velocity data stream and process data in near real time for alerts and analysis. Real time processing system using storm and Kafka will support the following functionality: - Horizontally scalable ingest - Ability to process large scale events (millions/sec) - Reliable processing (no data loss) - Archive processed events for further temporal analysis - Plugin support for event processing Storm supports high throughput event processing and achieves reliability using Kafka for incoming data. Processed events can generate events that can be acted upon for real time processing by additional jobs. The processed data is persisted using HBase for efficient storage and can be combined with historical data in the cluster for generating reports in short 30 min intervals. This real-time processing infrastructure will support mobile reports that are expected to be generated in near real-time, which in itself is a great value add as far as intrinsic business value.

11 Page 11 Data Management and Infrastructure Transformation 3 Advisory and Professional Services Offering Our partnership (as Xona Partners and the N42 team) with your IT strategy and delivery teams, would be based on the following advisory and professional services: Big Data Strategy Services: which includes, deciding on the right big data strategy, Transitioning from PoC to Production and Integrating Big Data with BI solutions Big Data Design and Deployment Services: which includes, Designing/Dimensioning Hadoop/Cassandra clusters, Troubleshooting/Optimization of Hadoop and Cassandra, Data Center design (physical design) + Network design for Big data (hadoop and cassandra), and Consulting on Pig, Hive, Sqoop, Flume, HBase, Spark, Storm, etc. Data Science Services: which includes Custom Map/Reduce development, Machine Learning/Data science on Hadoop (using Mahout or RHadoop), Hive/HBase development and Building analytics applications on Hadoop/HBase/Cassandra/MongoDB Big Data Operations services: Reliability & Uptime Assessment, Security & Governance (Policy, Compliance, etc.), Monitoring, Alerting, and Remediation (including Run books) Post initial analysis, the implementation teams would work jointly on executing the following tasks: Job Analytics: Provides ability to analyze Jobs, their current SLAs, failures, reasons for failures and SLA violations. For instance, Job failures due to infrastructure choke points and Data Skews are identified. Job performance profiling: The internal structure of the execution of the job is done by providing a detailed view of each task in a job, the different phases (shuffle) of the job and the performance of tasks within the job. Cluster Analytics: Provides Aggregate cluster usage and also Job slice of cluster usage. Job Profiling: Provides an analysis of all the previous runs of this job and provides diagnostic information Workflow analytics: Performance detailed for the entire workflow and also for individual jobs and task of the workflow. Visibility into different types of Jobs (ETL, Native MR, Pig, HBASE etc): Pig and/or Scalding jobs for file ingest Sqoop jobs for relational DB ingest Camus jobs for Kafka message queue ingest Storm bolts for ingest from storm clusters SLA and Policy Analytics: Analysis based on user defined SLA and policy.

12 Page 12 Data Management and Infrastructure Transformation 4 The N42 Multi-disciplinary Lead Team Few words around the N42 team, to better understand our perspective and our thesis in terms of how to approach these fundamental technical and business challenges. N42 has developed deep expertise and knowhow in big data and has experienced, first hands, the various challenges associated with building and operating large-scale big data infrastructures. Our solutions help Customers monitor and analyze their Big Data infrastructure and ensure through SLA and Policy definitions that it is operating to meet the business needs, which is, and will become even more central to any data centric business. Team N42 has extensive experience in creating infrastructures and applications that can receive live upgrades. We understand the operational complexity of major data transformation and application upgrades. We do incremental upgrades with minimal or no downtime of services using prep, update, switch and stabilize. N42 has experience of doing this in SaaS based Healthcare, HelpDesk, Online Recruitment, Retail and Distributed Manufacturing applications. Analysis phase will include evaluation of current infrastructure usage, failure characteristics, and data architecture with help of N42 tools. The results will be evaluated by a team comprising of data warehouse architect, system architect and ad-tech functional expert resulting in capacity recommendations, data architecture recommendations and transition/scaling plan minimizing risk to critical business operations. The team has demonstrated the following in previous projects: Deep and broad experience in the data management domain, both in front end and back end infrastructure solutions Strong expertise in Data Science, Statistical analysis and machine learning in the context of any industry vertical business Extensive experience in architecting and implementing Big Data for data heavy businesses like one of the largest Social Media network Detail experience with implementation and management of large Big Data installations, as well as the underlying network and IT infrastructure. Strong project management and delivery expertise in delivering complex projects with geographically dispersed teams. More specifically, the N42 has expertise and offers services in designing and deploying large scale, complex projects in the areas of: Big Data and Data Warehousing Data Science and Ad-Tech domain Data Center Design and implementation Cloud evolution and migration to cloud strategies Enterprise Architecture

13 Page 13 Data Management and Infrastructure Transformation 5 Xona Partners Strategic Advisory Team Xona Partners is a boutique advisory services firm specialized in technology strategies and founded by a team of renowned technologists and startup founders, managing directors in global ventures, and investment advisors. Drawing on its founders cross-functional expertise, Xona offers a unique multi-disciplinary integrative technology and investment advisory service to private equity and venture funds, technology corporations, and regulators and public sector organizations. Xona services bridge the widening gap between technology, finance and marketing silos by leveraging its founders leadership experience. The services focus on two main areas: 1. Pre and post investment services M&A due diligence (business and technical) Investment advisory Regulatory advisory Full life-cycle management 2. Technology advisory services in the TMT sector 4G/broadband & mobile communications Big Data infrastructure and analytics Cloud platforms and applications Xona Partners serve clients in the following segments: 1. Private Equity & Venture Funds: M&A Due Diligence; Help to Understand Prospective Markets & Technologies. 2. Technology Corporations: Advise investment & technology arms of banks, insurers, healthcare, automotive & other verticals providing domain expertise. 3. Governments, Regulatory & Policy Makers: Provide Unbiased Assessment on Markets & Technologies; Assist in Forming Policy Decisions Xona partners have been specifically active in the area of cloud enablement and IT transformation, advising clients globally on technical and business considerations, and partnering with world class product and professional services teams in terms of implementation and execution.

14 Page 14 Data Management and Infrastructure Transformation 6 Proposal & Call for Partnership Following some successful validation over the last 18 months, where large scale IT transformation projects have been conducted, involving designs with some of the most aggressive scaling, reliability and manageability deployment requirements, we, at N42 and Xona Partners, are now in the process of taking in-house development and deployment methodologies to broader markets, and would welcome discussing specific requirements with key IT architects having for a mission to lead their IT transformation architectures, as well as with managed services players wanting to build on their existing IT and big data capabilities and augment it with specific cloud based data management platforms. We believe that the next generation big data platform architectures will be evolving in the direction we have been highlighting, and hence, encourage various players to speed up such evolution, for the common interest of the various eco-system players. We would welcome further analysis and reflection around these fast moving topics, with the goals of jointly designing and putting into place, adequate IT and data transformation architectures and implementations. Contact and Further Discussions For more information, contact any of our team members: Gopal (dommety@networks42.com), Gopinath (rebala@networks42.com), Vineet (dhynai@networks42.com), Paddy (ramanathan@ networks42.com) or Riad (riad@xonapartners.com). Pleasure to meet up for further discussions in the Silicon Valley area or any of our global locations in Hong Kong, London and Bangalore. Xona Partners advisors@xonapartners.com An N42 and Xona Partners Xona Partners Collaboration 2013 White Paper