and Brief Journey into the New World of Next Generation Data Infrastructures IT Transformation Advisory Proposal

Size: px
Start display at page:

Download "and Brief Journey into the New World of Next Generation Data Infrastructures IT Transformation Advisory Proposal"

Transcription

1 and Brief Journey into the New World of Next Generation Data Infrastructures IT Transformation Advisory Proposal Dr. Gopal Dommety, Gopinath Rebala, Dr. Vineet Dhyani, Paddy Ramanathan (N42) Dr. Riad Hartani (Xona Partners) December 2013

2 Page 2 Data Management and Infrastructure Transformation 1 Preamble The last decade has seen the emergence of a large number of lead information centric technology players that share one thing in common: without a reliable large scale data management infrastructure, business will rapidly converge to zero. Lead players such as Google, Facebook, Amazon, Yahoo, Baidu, Ali Baba and a select few, know this way too well, and have as such, made their infrastructure management a top priority. The next decade is looking much of the same, with one caveat that top priority is now becoming the first priority. In fact, this same data and Information technology (IT) infrastructure, which has evolved pretty much linearly over the years, is soon to see significant architectural disruptions, with the goal of making it, multiple degrees of magnitude over, more scalable, reliable, intelligent and manageable. This is exactly where, N42, Inc. ( N42 ), in partnership with Xona Partners, has set sights in terms of putting together a transformational data and information technology practice to address these upcoming challenges. In retrospect, and looking back at the IT doesn t matter seminal paper (Harvard Business Review circa 2003), we believe and are ready to say that 2014 onwards, IT matters more than ever. Initiating our analysis from N42 s Menlo Park, California s silicon valley headquarters, and Xona Partners San Francisco & Singapore headquarters, and leveraging the hands-on know how of our teams out of Hong Kong, Bangalore and London, we set sail for a journey around what we see in terms of Data Infrastructure evolution challenges, and highlight our evolving contributions aiming at overcoming them. This short positioning paper, is presented as a baseline for follow up detailed discussions related to the various topics under consideration, with the goal of designing, customizing and optimizing our solutions to lead data centric organizations needs, leveraging the broad and complementary expertise of our team. It specifically builds off N42 successful cloud and data transformation rollouts in the recent past, as well as Xona Partners extensive IT transformation and Cloud migration audits, strategies and implementations. The fundamental premise of the question we are addressing is fairly simple: As a business that is built around gathering large scale data sets, mining and learning through such data, and optimizing communication between those producing it and those using it, where shall I go from here? This would touch upon the evolution of the compute platform over which data sets sit, but also addresses aspects as varied as Business Intelligence, Data Warehousing, Reliability & Availability, Performance & Scalability, Monitoring, Metrics & Diagnostics, Real-time Analytics, Data API and Governance and additional requirements such as security. 1.1 Rationale for a data infrastructure evolution revisit Most information technology players, are, or will soon be considering enhancing their data infrastructure to help build data solutions which will optimize their ability in identifying, capturing, and managing data to provide actionable, trusted insights that improve strategic and operational decision making, resulting in incremental revenues and a better customer experience. The current challenges of the existing platforms mostly affect the operations teams ability to provide reliable SLAs for the reporting jobs that are critical for the business, the data platforms

3 Page 3 Data Management and Infrastructure Transformation required scale, and perform effectively as the business grows across distributed geographies. As such, the desired goal is to create a solid foundation architecture that is able to provide these optimal functional capabilities, and a platform to overlay additional applications such as intelligent business intelligence and Data Science as a service capability. For most players, an insider-only approach, leveraging internal resources, would only go so far in terms of architecting and designing this new generation architecture, given the breadth in terms areas of expertise required, and more importantly, the need for a step-back and outside the box design way of doing things. As such, our teams aim at being the strategic advisory partner bringing in such expertise, and build open models tested and validated over a large set of data platform development models. 2 Data Infrastructure The Road Ahead Our approach to these IT transformation goals, based itself on our understanding of existing big data architectures and deployment models and the assumptions we set as far how data information models are set and the underlying performance and reliability requirements. Additionally, our approach aims at providing a capability maturity path for increasing capabilities of the system with minimal disruption to current operations, forming the basis for an evolutionary migration. As such, the architectural models we built our platforms upon, are designed to address some of the persistent problems in current infrastructure and future needs of the organization for data processing. 2.1 Current Data Infrastructure a revisit As of today, the existing information infrastructure and analytics processes suffer from challenges we have observed and worked on in the most typical large scale data and IT projects, some of which are listed below: Data quality issues: data consistency and completeness issues due to intake problems Stringent SLAs are not met for business critical jobs Implementation and orchestration of jobs in Java/Perl is fairly unproductive Incomplete metering, monitoring, and diagnostics of distributed software jobs and processes Increasing time of completion for software jobs as ad-hoc usage increases Data architecture issues regarding naming, change management, and running analytics Lack of comprehensive policy-based management for retention, replication, archival, and compression Lack of incremental analysis with real time data - generates reports and events only from near real-time data Lack of comprehensive reports with near-real time granularity as opposed hours or days Consolidate multiple data centers to a combined cluster for data analysis

4 Page 4 Data Management and Infrastructure Transformation We believe that monitoring for usage patterns and setting policies for meeting acceptable SLAs along with capacity planning will give predictability and performance to the system. A thorough process to resources mapping analysis for better understanding of system usage will lead to novel architectures that will allow for costing and time estimates for meeting the required SLAs. In the upcoming sections, we position our thesis on how these data architectures should be evolving, short, mid and long terms. The analysis is presented in a generic fashion, but builds on top of very selective and specific case studies that we have worked on in the real world, developed and completed, leveraging our designed methodologies, processes and tools. In other words, what is described is a pragmatic successfully completed case study, albeit made generic, to show broader applicability. We believe that lead IT data architects will associate with the assumptions, the models and the approach, and would see in it a template in which we would work with, hand in hand, to evolve their IT and data architectures. 2.2 Data Infrastructure Optimization: The low hanging fruit Our approach takes into account the capability maturity path by introducing components to the system to address pressing needs of reliability and predictability. Additionally, we introduce components to meet SLAs of what we anticipate, as upcoming system level jobs and to support performance requirements of the future. A representative architecture is presenter here that will support predictable process completion and faster analytical query processing.

5 Page 5 Data Management and Infrastructure Transformation Components added to the system are described with their functionality in this section. The suggestions here assume that there is no major upgrade to Hadoop infrastructure. Version information for components will be the compatible versions to existing infrastructure. - N42 Monitor - Flume - Oozie - Hive on Spark - Data API This architecture, when translated into an operational mode, needs to be fine-tuned based on analysis of the exiting actual infrastructure and operations. The analysis phase will include evaluation of current infrastructure usage, failure characteristics, and data architecture leveraging in-house custom designed tools and processes. The results will be evaluated by a team comprising of data warehouse architects, system architects and domain specific functional experts resulting in capacity recommendations, data architecture recommendations and transition/scaling plan minimizing risk to critical business operations. These various aspects are briefly highlighted below: Business Intelligence The architecture components are designed to provide off the shelf analytical components to fit in with minimal integration work. There are a few data visualization and reporting applications providing dashboards, charting and spreadsheets. Some of them would have been integrated with Hive and/or can work with JDBC, REST/Json formats. Pentaho, Talend, Tableau, Datameer, and Platfora are some of the options available for BI tools. Platfora is a good choice provides a good mix of integration with Hadoop ecosystem and easy to use frontend for data analysis. Data API layer on top of the Data Warehousing infrastructure is a software layer developed according to the requirements to provide security and decouples third party access with data architecture Data Warehousing Hive on Spark allows for in memory queries for analytics that provide near real time analysis of data as opposed to running Hive using MR batch processing on HDFS. This component will address SLA requirements of the reporting solution without having to implement the existing reports but with added performance. Additionally, this provides better data import/export than MongoDB nosql solution with better performance for lower cost. Hive provides better ability for ad-hoc queries and supports SQL type syntax for running analysis without having to write MR jobs in Java or Perl. Additionally, this provides horizontal scalability on existing cluster without additional data transfers in and out of cluster.

6 Page 6 Data Management and Infrastructure Transformation Reliability and Availability The custom designed N42 SLA and policy-monitoring tool provides visibility into job/workflow performance, guidance for performance improvement and with integration with Ganglia provides a comprehensive dashboard. This product allows for job/workflow level monitoring of cluster usage and provides predictability to job completion times improving the reliability of the system. Flume provides a reliable way of aggregating logger files from various web servers as well as streamlining input from tracker and audience data. Flume provides spooling directories, JMS source and plugins capability that can be used to customize Flume data ingest to detect missing expected data. Custom plugins can also be used perform some client-side cleaning before ingestion in to Hadoop with possible reduction in the amount of data ingested. Oozie provides event based workflow mechanism for launching jobs in the event of data ingest into HDFS or HCatalog. Additionally, Oozie provides an easy way of specifying job workflow including Pig and Hive jobs allowing SLA specification for workflow. These components will provide critical data reliability for running reports, data quality issues and will address the late data problem effectively even if dependencies on external systems are not completely eliminated Performance and Scalability Performance and scalability improvements are achieved using Hive on Spark as well as using N42 custom designed platform for monitoring and analysis. Linear scalability and performance with scale can be achieved by using the Hadoop 2.0 architectures. Hive/Spark along with proper processes in capacity allocation will resolve issues with reliable reports and ability to rerun reports and provide for better ad-hoc query performance Monitoring, Metrics and Diagnostics N42 tools provide a dashboard for comprehensive monitoring of the cluster using Ganglia monitoring system or existing monitoring system along with Job profile and analysis. This in turn, provides predictability to job completion times based on job profiles that provides excellent diagnostic capability for job performance and predictability. Cluster problems with out of memory, disk issues or job failures can be detected effectively and can be addressed proactively Data API Data API is a virtualization layer that hides the underlying platform details and provides REST or JDBC interfaces for external interaction. There are no on premise Data API solution providers, some providers like Qubole provide Data API services as SaaS model. Some solutions allow simplifies data import export to nosql databases. These solutions can be integrated to provide a consistent data view to external actors.

7 Page 7 Data Management and Infrastructure Transformation 2.3 Proposed Long Term Data Infrastructure A longer-term proposal is to migrate to YARN based Hadoop MR processing along with improved real time analytics for event processing using Storm as illustrated below. This proposed architecture provides the following features to existing system Improved data ingest process for achieving high confidence and consistency in data analysis even in the face of variable data loads Policy and SLA specification for workflows Visibility into job execution, cluster state and capability to debug jobs Faster ad hoc query processing for analysis Data API for integrating faster integration with BI tools Real-time data analysis and reports High Availability with fine grained data governance Improved Data Life cycle management

8 Page 8 Data Management and Infrastructure Transformation Components added to the system are described with their functionality in this section. We recommend using the latest version of distributions from Cloudera, Hortonworks for integrated components, better maintenance and performance improvements. One illustrative deployment is described below. - N42 Think - Hadoop Flume - Kafka - Sqoop - Oozie - Pig - Spark - Hive - HBase - Storm - Hue - Zookeeper - HCatalog Business Intelligence The architecture components are designed to provide off the shelf analytical components to fit in with minimal integration work as described in short term architecture. Platfora is a good choice provides a good mix of integration with Hadoop ecosystem and easy to use frontend for data analysis as mentioned in earlier section. Platfora supports ability to provide heat maps, charts and drilldowns to publishers Data Warehousing HBase is very efficient in fast time range scanning, time range queries, data drilldown etc. in the face of read only data with low throughput write data. Additionally, HBase supports quick snapshots and is ideal data warehousing platform. Data cubes stored in HBase allow cube operations such as pivoting, drilldown etc. via HBase. HBase is a good data warehousing option in terms of cost/performance for report generation. Hive on Spark allows for in memory queries for analytics that provide near real time analysis of data as opposed to running Hive using MR batch processing on HDFS. This component will address SLA requirements of the reporting solution without having to implement the existing

9 Page 9 Data Management and Infrastructure Transformation reports but with added performance. Additionally, this provides better data import/export than MongoDB nosql solution with better performance for lower cost. Hive on HBase supports ability to rerun batch queries, faster performance for ad-hoc queries, 30 minute resolution reports, and higher dimensional data store for better filtering and analysis of data with SQL query language Reliability and Availability Hadoop 2.0 supports distributed Jobtracker and high availability to Datanode. This avoids single point of failure for the Hadoop deployment. HDFS replication itself lends to high availability of data on file system. Zookeeper should be implemented with 3 nodes for high availability of cluster. Data policies for archiving and snapshots through HBase will provide reliability and disaster recovery options for the cluster. N42 designed SLA and policy-monitoring tools provide visibility into job/workflow performance, guidance for performance improvement and with integration with Ganglia provides a comprehensive dashboard Data ingest is one of the critical first steps in achieving data consistency for analysis. Data is received from large number of sources for tracking and audience back ends. Process needs to be in place to detect missing data from expected sources and ingest data into HDFS efficiently with error detection and optimization. Flume allows predictable and efficient data ingestion into HDFS file system providing visibility into failures and improving performance of data ingestion. Missing data can be detected with custom plugins to Flume pipeline. Depending on the requirements, it is possible to use Kafka in the pipeline for reliable delivery of data for preventing data loss. Kafka addresses the following scenarios Loss prevention for spike in logger and audience data that exceeds the capacity for HDFS ingest MR job failures at queue sink or any failures in the pipeline Oozie provides event based workflow mechanism for launching jobs in the event of data ingest into HDFS or HCatalog. Additionally, Oozie provides an easy way of specifying job workflow including Pig and Hive jobs allowing SLA specification for workflow. This implementation will allow better quality of data for reliable reports and better performance on scheduled reports as well as ad-hoc queries. Performance and Scalability Cloudera deployments are installed with Fair Scheduler as default option. For achieving high throughput for the jobs, we recommend using Capacity Scheduler as scheduler of choice. Performance and scalability improvements are achieved using Hive on Spark as well as using N42 platform for monitoring and analysis. Linear scalability and performance with scale can be achieved by using the Hadoop 2.0 architecture as defined in the next section.

10 Page 10 Data Management and Infrastructure Transformation This cluster is designed to be single cluster to support data needs that can consolidate all or some, of its data centers. Processes and policies in place for data lifecycle management for archiving, retention, compression and replication will allow for efficient data management with low overhead costs Monitoring, Metrics and Diagnostics The N42 platform provides a dashboard for comprehensive monitoring of the cluster using Ganglia monitoring system or existing monitoring system along with Job profile and analysis. It provides predictability to job completion times based on job profiles that provides excellent diagnostic capability for job performance and predictability. Fine-grained estimations on cluster usage by user, job type and time of day will allow for better policies for cluster usage and planning. The diagnostic insights lead for high performing jobs, better data design and lower failures in the cluster Data API Data API is a virtualization layer that hides underlying platform details and provides REST or JDBC interfaces for external interaction. There are no on premise Data API solution providers, some providers like Qubole provide Data API services as SaaS model. Some solutions allow simplifies data import export to nosql databases. These solutions can be integrated to provide a consistent data view to external actors. Application development using the data interfaces that are decoupled with data storage structure will lead to lower cost of maintenance and better integration with partners Real-time analytics Real-time data processing has to accommodate high velocity data stream and process data in near real time for alerts and analysis. Real time processing system using storm and Kafka will support the following functionality: - Horizontally scalable ingest - Ability to process large scale events (millions/sec) - Reliable processing (no data loss) - Archive processed events for further temporal analysis - Plugin support for event processing Storm supports high throughput event processing and achieves reliability using Kafka for incoming data. Processed events can generate events that can be acted upon for real time processing by additional jobs. The processed data is persisted using HBase for efficient storage and can be combined with historical data in the cluster for generating reports in short 30 min intervals. This real-time processing infrastructure will support mobile reports that are expected to be generated in near real-time, which in itself is a great value add as far as intrinsic business value.

11 Page 11 Data Management and Infrastructure Transformation 3 Advisory and Professional Services Offering Our partnership (as Xona Partners and the N42 team) with your IT strategy and delivery teams, would be based on the following advisory and professional services: Big Data Strategy Services: which includes, deciding on the right big data strategy, Transitioning from PoC to Production and Integrating Big Data with BI solutions Big Data Design and Deployment Services: which includes, Designing/Dimensioning Hadoop/Cassandra clusters, Troubleshooting/Optimization of Hadoop and Cassandra, Data Center design (physical design) + Network design for Big data (hadoop and cassandra), and Consulting on Pig, Hive, Sqoop, Flume, HBase, Spark, Storm, etc. Data Science Services: which includes Custom Map/Reduce development, Machine Learning/Data science on Hadoop (using Mahout or RHadoop), Hive/HBase development and Building analytics applications on Hadoop/HBase/Cassandra/MongoDB Big Data Operations services: Reliability & Uptime Assessment, Security & Governance (Policy, Compliance, etc.), Monitoring, Alerting, and Remediation (including Run books) Post initial analysis, the implementation teams would work jointly on executing the following tasks: Job Analytics: Provides ability to analyze Jobs, their current SLAs, failures, reasons for failures and SLA violations. For instance, Job failures due to infrastructure choke points and Data Skews are identified. Job performance profiling: The internal structure of the execution of the job is done by providing a detailed view of each task in a job, the different phases (shuffle) of the job and the performance of tasks within the job. Cluster Analytics: Provides Aggregate cluster usage and also Job slice of cluster usage. Job Profiling: Provides an analysis of all the previous runs of this job and provides diagnostic information Workflow analytics: Performance detailed for the entire workflow and also for individual jobs and task of the workflow. Visibility into different types of Jobs (ETL, Native MR, Pig, HBASE etc): Pig and/or Scalding jobs for file ingest Sqoop jobs for relational DB ingest Camus jobs for Kafka message queue ingest Storm bolts for ingest from storm clusters SLA and Policy Analytics: Analysis based on user defined SLA and policy.

12 Page 12 Data Management and Infrastructure Transformation 4 The N42 Multi-disciplinary Lead Team Few words around the N42 team, to better understand our perspective and our thesis in terms of how to approach these fundamental technical and business challenges. N42 has developed deep expertise and knowhow in big data and has experienced, first hands, the various challenges associated with building and operating large-scale big data infrastructures. Our solutions help Customers monitor and analyze their Big Data infrastructure and ensure through SLA and Policy definitions that it is operating to meet the business needs, which is, and will become even more central to any data centric business. Team N42 has extensive experience in creating infrastructures and applications that can receive live upgrades. We understand the operational complexity of major data transformation and application upgrades. We do incremental upgrades with minimal or no downtime of services using prep, update, switch and stabilize. N42 has experience of doing this in SaaS based Healthcare, HelpDesk, Online Recruitment, Retail and Distributed Manufacturing applications. Analysis phase will include evaluation of current infrastructure usage, failure characteristics, and data architecture with help of N42 tools. The results will be evaluated by a team comprising of data warehouse architect, system architect and ad-tech functional expert resulting in capacity recommendations, data architecture recommendations and transition/scaling plan minimizing risk to critical business operations. The team has demonstrated the following in previous projects: Deep and broad experience in the data management domain, both in front end and back end infrastructure solutions Strong expertise in Data Science, Statistical analysis and machine learning in the context of any industry vertical business Extensive experience in architecting and implementing Big Data for data heavy businesses like one of the largest Social Media network Detail experience with implementation and management of large Big Data installations, as well as the underlying network and IT infrastructure. Strong project management and delivery expertise in delivering complex projects with geographically dispersed teams. More specifically, the N42 has expertise and offers services in designing and deploying large scale, complex projects in the areas of: Big Data and Data Warehousing Data Science and Ad-Tech domain Data Center Design and implementation Cloud evolution and migration to cloud strategies Enterprise Architecture

13 Page 13 Data Management and Infrastructure Transformation 5 Xona Partners Strategic Advisory Team Xona Partners is a boutique advisory services firm specialized in technology strategies and founded by a team of renowned technologists and startup founders, managing directors in global ventures, and investment advisors. Drawing on its founders cross-functional expertise, Xona offers a unique multi-disciplinary integrative technology and investment advisory service to private equity and venture funds, technology corporations, and regulators and public sector organizations. Xona services bridge the widening gap between technology, finance and marketing silos by leveraging its founders leadership experience. The services focus on two main areas: 1. Pre and post investment services M&A due diligence (business and technical) Investment advisory Regulatory advisory Full life-cycle management 2. Technology advisory services in the TMT sector 4G/broadband & mobile communications Big Data infrastructure and analytics Cloud platforms and applications Xona Partners serve clients in the following segments: 1. Private Equity & Venture Funds: M&A Due Diligence; Help to Understand Prospective Markets & Technologies. 2. Technology Corporations: Advise investment & technology arms of banks, insurers, healthcare, automotive & other verticals providing domain expertise. 3. Governments, Regulatory & Policy Makers: Provide Unbiased Assessment on Markets & Technologies; Assist in Forming Policy Decisions Xona partners have been specifically active in the area of cloud enablement and IT transformation, advising clients globally on technical and business considerations, and partnering with world class product and professional services teams in terms of implementation and execution.

14 Page 14 Data Management and Infrastructure Transformation 6 Proposal & Call for Partnership Following some successful validation over the last 18 months, where large scale IT transformation projects have been conducted, involving designs with some of the most aggressive scaling, reliability and manageability deployment requirements, we, at N42 and Xona Partners, are now in the process of taking in-house development and deployment methodologies to broader markets, and would welcome discussing specific requirements with key IT architects having for a mission to lead their IT transformation architectures, as well as with managed services players wanting to build on their existing IT and big data capabilities and augment it with specific cloud based data management platforms. We believe that the next generation big data platform architectures will be evolving in the direction we have been highlighting, and hence, encourage various players to speed up such evolution, for the common interest of the various eco-system players. We would welcome further analysis and reflection around these fast moving topics, with the goals of jointly designing and putting into place, adequate IT and data transformation architectures and implementations. Contact and Further Discussions For more information, contact any of our team members: Gopal (dommety@networks42.com), Gopinath (rebala@networks42.com), Vineet (dhynai@networks42.com), Paddy (ramanathan@ networks42.com) or Riad (riad@xonapartners.com). Pleasure to meet up for further discussions in the Silicon Valley area or any of our global locations in Hong Kong, London and Bangalore. Xona Partners advisors@xonapartners.com An N42 and Xona Partners Xona Partners Collaboration 2013 White Paper

Riding the Advanced Cloud Deployment Roadmap. Creationline, Inc. Team Dr. Riad Hartani, Rolf Lumpe (Xona Partners)

Riding the Advanced Cloud Deployment Roadmap. Creationline, Inc. Team Dr. Riad Hartani, Rolf Lumpe (Xona Partners) Riding the Advanced Cloud Deployment Roadmap Creationline, Inc. Team Dr. Riad Hartani, Rolf Lumpe (Xona Partners) August 15 th, 2014 Page 2 Table of Contents 1 SYNOPSIS 3 2 RATIONALE FOR A CLOUD INFRASTRUCTURE

More information

The Future of Data Management

The Future of Data Management The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

More information

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce

More information

Comprehensive Analytics on the Hortonworks Data Platform

Comprehensive Analytics on the Hortonworks Data Platform Comprehensive Analytics on the Hortonworks Data Platform We do Hadoop. Page 1 Page 2 Back to 2005 Page 3 Vertical Scaling Page 4 Vertical Scaling Page 5 Vertical Scaling Page 6 Horizontal Scaling Page

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Upcoming Announcements

Upcoming Announcements Enterprise Hadoop Enterprise Hadoop Jeff Markham Technical Director, APAC jmarkham@hortonworks.com Page 1 Upcoming Announcements April 2 Hortonworks Platform 2.1 A continued focus on innovation within

More information

The Future of Data Management with Hadoop and the Enterprise Data Hub

The Future of Data Management with Hadoop and the Enterprise Data Hub The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah Cofounder & CTO, Cloudera, Inc. Twitter: @awadallah 1 2 Cloudera Snapshot Founded 2008, by former employees of Employees

More information

Bringing Big Data to People

Bringing Big Data to People Bringing Big Data to People Microsoft s modern data platform SQL Server 2014 Analytics Platform System Microsoft Azure HDInsight Data Platform Everyone should have access to the data they need. Process

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION Syed Rasheed Solution Manager Red Hat Corp. Kenny Peeples Technical Manager Red Hat Corp. Kimberly Palko Product Manager Red Hat Corp.

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning

More information

Big Data Management and Security

Big Data Management and Security Big Data Management and Security Audit Concerns and Business Risks Tami Frankenfield Sr. Director, Analytics and Enterprise Data Mercury Insurance What is Big Data? Velocity + Volume + Variety = Value

More information

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data Research Report CA Technologies Big Data Infrastructure Management Executive Summary CA Technologies recently exhibited new technology innovations, marking its entry into the Big Data marketplace with

More information

Data Sciences Applications in Financial Technologies. Dr. Riad Hartani, Dr. James Shanahan (Xona Partners) Paddy Ramanathan (Digital Confluence)

Data Sciences Applications in Financial Technologies. Dr. Riad Hartani, Dr. James Shanahan (Xona Partners) Paddy Ramanathan (Digital Confluence) Data Sciences Applications in Financial Technologies Dr. Riad Hartani, Dr. James Shanahan (Xona Partners) Paddy Ramanathan (Digital Confluence) January 2014 Page 2 Applied Data Sciences in Financial Technologies

More information

Apache Hadoop: Past, Present, and Future

Apache Hadoop: Past, Present, and Future The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer aaa@cloudera.com, twitter: @awadallah Hadoop Past

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

HDP Enabling the Modern Data Architecture

HDP Enabling the Modern Data Architecture HDP Enabling the Modern Data Architecture Herb Cunitz President, Hortonworks Page 1 Hortonworks enables adoption of Apache Hadoop through HDP (Hortonworks Data Platform) Founded in 2011 Original 24 architects,

More information

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate

More information

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved. Big Data Analytics 1 Priority Discussion Topics What are the most compelling business drivers behind big data analytics? Do you have or expect to have data scientists on your staff, and what will be their

More information

Cisco IT Hadoop Journey

Cisco IT Hadoop Journey Cisco IT Hadoop Journey Alex Garbarini, IT Engineer, Cisco 2015 MapR Technologies 1 Agenda Hadoop Platform Timeline Key Decisions / Lessons Learnt Data Lake Hadoop s place in IT Data Platforms Use Cases

More information

How to Enhance Traditional BI Architecture to Leverage Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...

More information

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84 Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics

More information

GigaSpaces Real-Time Analytics for Big Data

GigaSpaces Real-Time Analytics for Big Data GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Data Sciences Focus Mobile Eco-System Contributions

Data Sciences Focus Mobile Eco-System Contributions Data Sciences Focus Mobile Eco-System Contributions Dr. Riad Hartani, Dr. Alex Popescul, Dr. James Shanahan November 2013 Page 2 Data Science Evolution A Brief Revisit As a team, we have first contributed

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Data Integration Checklist

Data Integration Checklist The need for data integration tools exists in every company, small to large. Whether it is extracting data that exists in spreadsheets, packaged applications, databases, sensor networks or social media

More information

Big Data at Cloud Scale

Big Data at Cloud Scale Big Data at Cloud Scale Pushing the limits of flexible & powerful analytics Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For

More information

#TalendSandbox for Big Data

#TalendSandbox for Big Data Evalua&on von Apache Hadoop mit der #TalendSandbox for Big Data Julien Clarysse @whatdoesdatado @talend 2015 Talend Inc. 1 Connecting the Data-Driven Enterprise 2 Talend Overview Founded in 2006 BRAND

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Cloudera Enterprise Data Hub in Telecom:

Cloudera Enterprise Data Hub in Telecom: Cloudera Enterprise Data Hub in Telecom: Three Customer Case Studies Version: 103 Table of Contents Introduction 3 Cloudera Enterprise Data Hub for Telcos 4 Cloudera Enterprise Data Hub in Telecom: Customer

More information

Virtualizing Apache Hadoop. June, 2012

Virtualizing Apache Hadoop. June, 2012 June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and

More information

Dominik Wagenknecht Accenture

Dominik Wagenknecht Accenture Dominik Wagenknecht Accenture Improving Mainframe Performance with Hadoop October 17, 2014 Organizers General Partner Top Media Partner Media Partner Supporters About me Dominik Wagenknecht Accenture Vienna

More information

Reference Architecture, Requirements, Gaps, Roles

Reference Architecture, Requirements, Gaps, Roles Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture

More information

Unified Batch & Stream Processing Platform

Unified Batch & Stream Processing Platform Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns Table of Contents Abstract... 3 Introduction... 3 Definition... 3 The Expanding Digitization

More information

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Part I By Sam Poozhikala, Vice President Customer Solutions at StratApps Inc. 4/4/2014 You may contact Sam Poozhikala at spoozhikala@stratapps.com.

More information

Building Your Big Data Team

Building Your Big Data Team Building Your Big Data Team With all the buzz around Big Data, many companies have decided they need some sort of Big Data initiative in place to stay current with modern data management requirements.

More information

Data Security in Hadoop

Data Security in Hadoop Data Security in Hadoop Eric Mizell Director, Solution Engineering Page 1 What is Data Security? Data Security for Hadoop allows you to administer a singular policy for authentication of users, authorize

More information

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer, Cofounder @mikeolson

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer, Cofounder @mikeolson The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer, Cofounder @mikeolson 1 A New Platform for Pervasive Analytics Multiple big data opportunities

More information

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers Modern IT Operations Management Why a New Approach is Required, and How Boundary Delivers TABLE OF CONTENTS EXECUTIVE SUMMARY 3 INTRODUCTION: CHANGING NATURE OF IT 3 WHY TRADITIONAL APPROACHES ARE FAILING

More information

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08

More information

Introduction to Big Data Training

Introduction to Big Data Training Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB

More information

Big Data Analytics Platform @ Nokia

Big Data Analytics Platform @ Nokia Big Data Analytics Platform @ Nokia 1 Selecting the Right Tool for the Right Workload Yekesa Kosuru Nokia Location & Commerce Strata + Hadoop World NY - Oct 25, 2012 Agenda Big Data Analytics Platform

More information

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap 3 key strategic advantages, and a realistic roadmap for what you really need, and when 2012, Cognizant Topics to be discussed

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

QUICK FACTS. Delivering a Unified Data Architecture for Sony Computer Entertainment America TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES

QUICK FACTS. Delivering a Unified Data Architecture for Sony Computer Entertainment America TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES [ Consumer goods, Data Services ] TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES QUICK FACTS Objectives Develop a unified data architecture for capturing Sony Computer Entertainment America s (SCEA)

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April 9 2013

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April 9 2013 Integrating Hadoop Into Business Intelligence & Data Warehousing Philip Russom TDWI Research Director for Data Management, April 9 2013 TDWI would like to thank the following companies for sponsoring the

More information

Cisco IT Hadoop Journey

Cisco IT Hadoop Journey Cisco IT Hadoop Journey Srini Desikan, Program Manager IT 2015 MapR Technologies 1 Agenda Hadoop Platform Timeline Key Decisions / Lessons Learnt Data Lake Hadoop s place in IT Data Platforms Use Cases

More information

Has been into training Big Data Hadoop and MongoDB from more than a year now

Has been into training Big Data Hadoop and MongoDB from more than a year now NAME NAMIT EXECUTIVE SUMMARY EXPERTISE DELIVERIES Around 10+ years of experience on Big Data Technologies such as Hadoop and MongoDB, Java, Python, Big Data Analytics, System Integration and Consulting

More information

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how EMC Elastic Cloud Storage (ECS ) can be used to streamline the Hadoop data analytics

More information

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop 1 Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop 2 Pivotal s Full Approach It s More Than Just Hadoop Pivotal Data Labs 3 Why Pivotal Exists First Movers Solve the Big Data Utility Gap

More information

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Roadmap Talend : découvrez les futures fonctionnalités de Talend Roadmap Talend : découvrez les futures fonctionnalités de Talend Cédric Carbone Talend Connect 9 octobre 2014 Talend 2014 1 Connecting the Data-Driven Enterprise Talend 2014 2 Agenda Agenda Why a Unified

More information

How Companies are! Using Spark

How Companies are! Using Spark How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made

More information

The Inside Scoop on Hadoop

The Inside Scoop on Hadoop The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. Orion.Gebremedhin@Neudesic.COM B-orgebr@Microsoft.com @OrionGM The Inside Scoop

More information

Big data blue print for cloud architecture

Big data blue print for cloud architecture Big data blue print for cloud architecture -COGNIZANT Image Area Prabhu Inbarajan Srinivasan Thiruvengadathan Muralicharan Gurumoorthy Praveen Codur 2012, Cognizant Next 30 minutes Big Data / Cloud challenges

More information

The Digital Enterprise Demands a Modern Integration Approach. Nada daveiga, Sr. Dir. of Technical Sales Tony LaVasseur, Territory Leader

The Digital Enterprise Demands a Modern Integration Approach. Nada daveiga, Sr. Dir. of Technical Sales Tony LaVasseur, Territory Leader The Digital Enterprise Demands a Modern Integration Approach Nada daveiga, Sr. Dir. of Technical Sales Tony LaVasseur, Territory Leader Yesterday s approach to data and application integration is a barrier

More information

Native Connectivity to Big Data Sources in MSTR 10

Native Connectivity to Big Data Sources in MSTR 10 Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single

More information

More Data in Less Time

More Data in Less Time More Data in Less Time Leveraging Cloudera CDH as an Operational Data Store Daniel Tydecks, Systems Engineering DACH & CE Goals of an Operational Data Store Load Data Sources Traditional Architecture Operational

More information

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Addressing Open Source Big Data, Hadoop, and MapReduce limitations Addressing Open Source Big Data, Hadoop, and MapReduce limitations 1 Agenda What is Big Data / Hadoop? Limitations of the existing hadoop distributions Going enterprise with Hadoop 2 How Big are Data?

More information

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Analytics - Accelerated. stream-horizon.com Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based

More information

... ... PEPPERDATA OVERVIEW AND DIFFERENTIATORS ... ... ... ... ...

... ... PEPPERDATA OVERVIEW AND DIFFERENTIATORS ... ... ... ... ... ..................................... WHITEPAPER PEPPERDATA OVERVIEW AND DIFFERENTIATORS INTRODUCTION Prospective customers will often pose the question, How is Pepperdata different from tools like Ganglia,

More information

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Defining Big Not Just Massive Data Big data refers to data sets whose size is beyond the ability of typical database software tools

More information

Real Time Big Data Processing

Real Time Big Data Processing Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure

More information

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution WHITEPAPER A Technical Perspective on the Talena Data Availability Management Solution BIG DATA TECHNOLOGY LANDSCAPE Over the past decade, the emergence of social media, mobile, and cloud technologies

More information

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by :- Ishank Kumar Aakash Patel Vishnu Dev Yadav CONTENT Abstract Introduction Related work The Ecosystem Ingress

More information

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS ..................................... PEPPERDATA IN MULTI-TENANT ENVIRONMENTS technical whitepaper June 2015 SUMMARY OF WHAT S WRITTEN IN THIS DOCUMENT If you are short on time and don t want to read the

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

White Paper: What You Need To Know About Hadoop

White Paper: What You Need To Know About Hadoop CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack

More information

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate

More information

Please give me your feedback

Please give me your feedback Please give me your feedback Session BB4089 Speaker Claude Lorenson, Ph. D and Wendy Harms Use the mobile app to complete a session survey 1. Access My schedule 2. Click on this session 3. Go to Rate &

More information

Data Services Advisory

Data Services Advisory Data Services Advisory Modern Datastores An Introduction Created by: Strategy and Transformation Services Modified Date: 8/27/2014 Classification: DRAFT SAFE HARBOR STATEMENT This presentation contains

More information

Databricks. A Primer

Databricks. A Primer Databricks A Primer Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically

More information

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp Agenda Hadoop and storage Alternative storage architecture for Hadoop Use cases and customer examples

More information

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2016 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and

More information

How to Hadoop Without the Worry: Protecting Big Data at Scale

How to Hadoop Without the Worry: Protecting Big Data at Scale How to Hadoop Without the Worry: Protecting Big Data at Scale SESSION ID: CDS-W06 Davi Ottenheimer Senior Director of Trust EMC Corporation @daviottenheimer Big Data Trust. Redefined Transparency Relevance

More information

How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5

How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5 Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark

More information

Communicating with the Elephant in the Data Center

Communicating with the Elephant in the Data Center Communicating with the Elephant in the Data Center Who am I? Instructor Consultant Opensource Advocate http://www.laubersoltions.com sml@laubersolutions.com Twitter: @laubersm Freenode: laubersm Outline

More information

BIG DATA TOOLS. Top 10 open source technologies for Big Data

BIG DATA TOOLS. Top 10 open source technologies for Big Data BIG DATA TOOLS Top 10 open source technologies for Big Data We are in an ever expanding marketplace!!! With shorter product lifecycles, evolving customer behavior and an economy that travels at the speed

More information

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

IBM Big Data Platform

IBM Big Data Platform IBM Big Data Platform Turning big data into smarter decisions Stefan Söderlund. IBM kundarkitekt, Försvarsmakten Sesam vår-seminarie Big Data, Bigga byte kräver Pigga Hertz! May 16, 2013 By 2015, 80% of

More information

Three Open Blueprints For Big Data Success

Three Open Blueprints For Big Data Success White Paper: Three Open Blueprints For Big Data Success Featuring Pentaho s Open Data Integration Platform Inside: Leverage open framework and open source Kickstart your efforts with repeatable blueprints

More information

Performance and Scalability Overview

Performance and Scalability Overview Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics Platform. Contents Pentaho Scalability and

More information

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!! Simplifying Big Data Analytics: Unifying Batch and Stream Processing John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!! Streaming Analy.cs S S S Scale- up Database Data And Compute Grid

More information

Big Data Analytics for Space Exploration, Entrepreneurship and Policy Opportunities. Tiffani Crawford, PhD

Big Data Analytics for Space Exploration, Entrepreneurship and Policy Opportunities. Tiffani Crawford, PhD Big Analytics for Space Exploration, Entrepreneurship and Policy Opportunities Tiffani Crawford, PhD Big Analytics Characteristics Large quantities of many data types Structured Unstructured Human Machine

More information

Big Data Introduction

Big Data Introduction Big Data Introduction Ralf Lange Global ISV & OEM Sales 1 Copyright 2012, Oracle and/or its affiliates. All rights Conventional infrastructure 2 Copyright 2012, Oracle and/or its affiliates. All rights

More information